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PREFACE 


This book covers most topics needed to develop a broad and thorough working knowl- 
edge of modern computational statistics. We seek to develop a practical understanding 
of how and why existing methods work, enabling readers to use modern statistical 
methods effectively. Since many new methods are built from components of exist- 
ing techniques, our ultimate goal is to provide scientists with the tools they need to 
contribute new ideas to the field. 

A growing challenge in science is that there is so much of it. W hile the pursuit 
of important new methods and the honing of existing approaches is a worthy goal, 
there is also a need to organize and distill the teeming jungle of ideas. We attempt to 
do that here. Our choice of topics reflects our view of what constitutes the core of the 
evolving field of computational statistics, and what will be interesting and useful for 
our readers. 

Our use of the adjective modern in the first sentence of this prefaceis potentially 
troublesome: There is no way thatthis book can cover all the latest, greatest techniques. 
We have not even tried. We have instead aimed to provide a reasonably up-to-date 
survey of a broad portion of the field, while leaving room for diversions and esoterica. 

The foundations of optimization and numerical integration are covered in this 
book. We include these venerable topics because (i) they are cornerstones of frequen- 
tist and Bayesian inference; (ii) routine application of available software often fails 
for hard problems; and (iii) the methods themselves are often secondary components 
of other statistical computing algorithms. Some topics we have omitted represent 
important areas of past and present research in the field, but their priority here is 
lowered by the availability of high-quality software. For example, the generation of 
pseudo-random numbers is a classic topic, but one that we prefer to address by giv- 
ing students reliable software. Finally, some topics (e.g., principal curves and tabu 
search) are included simply because they are interesting and provide very different 
perspectives on familiar problems. Perhaps a future researcher may draw ideas from 
such topics to design a creative and effective new algorithm. 

In this second edition, we have both updated and broadened our coverage, and 
wenow provide computer code. For example, we have added new M CM C topics to re- 
flect continued activity in that popular area. A notable increase in breadth is our inclu- 
sion of more methods relevant for problems where statistical dependency is important, 
such as block bootstrapping and sequential importance sampling. This second edition 
provides extensive new supportin R. Specifically, code for the examples in this book 
is available from the book website www.stat.colostate.edu/computati onal statistics. 

Our target audience includes graduate students in statistics and related fields, 
statisticians, and quantitative empirical scientists in other fields. We hope such readers 
may use the book when applying standard methods and developing new methods. 


XV 


XVİ PREFACE 


The level of mathematics expected of the reader does not extend much beyond 
Taylor series and linear algebra. Breadth of mathematical training is more helpful 
than depth. Essential review is provided in Chapter 1. M ore advanced readers will find 
greater mathematical detail in the wide variety of high-quality books available on spe- 
cific topics, many of which are referenced in the text. Other readers caring less about 
analytical details may prefer to focus on our descriptions of algorithms and examples. 

The expected level of statistics is equivalent to that obtained by a graduate 
student in his or her first year of study of the theory of statistics and probability. 
An understanding of maximum likelihood methods, Bayesian methods, elementary 
asymptotic theory, M arkov chains, and linear models is most important. M any of 
these topics are reviewed in Chapter 1. 

With respect to computer programming, we find that good students can learn as 
they go. H owever, aworking knowledge of a suitable language allows implementation 
of the ideas covered in this book to progress much more quickly. We have chosen 
to forgo any language-specific examples, algorithms, or coding in the text. For those 
wishing to learn alanguage whilethey study this book, we recommend that you choose 
a high-level, interactive package that permits the flexible design of graphical displays 
and includes supporting statistics and probability functions, such as R and M ATLAB.! 
These are the sort of languages often used by researchers during the development of 
new statistical computing techniques, and they are suitable for implementing all the 
methods we describe, except in some cases for problems of vast scope or complexity. 
We use R and recommend it. Although lower-level languages such as C++ could 
also be used, they are more appropriate for professional-grade implementation of 
algorithms after researchers have refined the methodology. 

The book is organized into four major parts: optimization (Chapters 2, 3, and 
4), integration and simulation (Chapters 5, 6, 7, and 8), bootstrapping (Chapter 9) 
and density estimation and smoothing (Chapters 10, 11, and 12). The chapters are 
written to stand independently, so a course can be built by selecting the topics one 
wishes to teach. For a one-semester course, our selection typically weights most 
heavily topics from Chapters 2, 3, 6, 7, 9, 10, and 11. With a leisurely pace or more 
thorough coverage, a shorter list of topics could still easily fill a semester course. 
There is sufficient material here to provide a thorough one-year course of study, 
notwithstanding any supplemental topics one might wish to teach. 

A variety of homework problems are included at the end of each chapter. Some 
are straightforward, while others require the student to develop a thorough under- 
standing of the model/method being used, to carefully (and perhaps cleverly) codea 
suitable technique, and to devote considerable attention to the interpretation of results. 
A few exercises invite open-ended exploration of methods and ideas. We are some- 
times asked for solutions to the exercises, but we prefer to sequester them to preserve 
the challenge for future students and readers. 

The datasets discussed in the examples and exercises are available from the book 
website, www.stat.colostate.edu/computational statistics. The R code is also provided 
there. Finally, the website includes an errata. R esponsibility for all errors lies with us. 


TR is available for free from www.r-project.org. Information about MATLAB can be found at 
www.mathworks.com. 
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CHAPTER 1 


REVIEW 


This chapter reviews notation and background material in mathematics, probability, 
and statistics. Readers may wish to skip this chapter and turn directly to Chapter 2, 
returning here only as needed. 


1.1 MATHEMATICAL NOTATION 


We use boldface to distinguish a vector x = (x1, . . . , Xp) or a matrix M from a scalar 
variable x or a constant M. A vector-valued function f evaluated at x is also boldfaced, 
as in f(x) = (fi (x), ..., fp(x)). The transpose of M is denoted MT. 

Unless otherwise specified, all vectors are considered to be column vectors, so, 
for example, an n x p matrix can be written as M = (x, ... x,)!. Let I denote an 
identity matrix, and 1 and 0 denote vectors of ones and zeros, respectively. 

A symmetric square matrix M is positive definite if x'Mx > 0 for all nonzero 
vectors x. Positive definiteness is equivalent to the condition that all eigenvalues of 
M are positive. M is nonnegative definite or positive semidefinite if x' Mx > 0 for all 
nonzero vectors x. 

The derivative of a function f, evaluated at x, is denoted f'(x). When x = 
(X1,...,Xp), the gradient of f at x is 


TR e a a 


dxı dXp 


The Hessian matrix for f at x is f”(x) having (i, j)th element equal to 4? f(x)/ 
(dx; dxj). The negative Hessian has important uses in statistical inference. 

Let J(x) denote the Jacobian matrix evaluated at x for the one-to-one mapping 
y = f(x). The (i, j)th element of J(x) is equal to dfj(x)/dx;. 

A functional is a real-valued function on a space of functions. For example, if 
T(f)= Í f(x) dx, then the functional T maps suitably integrable functions onto the 
real line. 

The indicator function 1,4} equals 1 if A is true and 0 otherwise. The real line 
is denoted ʻA, and p-dimensional real space is RP. 


Computational Statistics, Second Edition. Geof H. Givens and Jennifer A. Hoeting. 
© 2013 John Wiley & Sons, Inc. Published 2013 by John Wiley & Sons, Inc. 


2 = CHAPTER1 REVIEW 


1.2 TAYLOR’S THEOREM AND MATHEMATICAL 
LIMIT THEORY 


First, we define standard “big oh” and “little oh” notation for describing the relative 
orders of convergence of functions. Let the functions f and g be defined on a common, 
possibly infinite interval. Let zo be a point in this interval or a boundary point of it 
(i.e., —0o or 00). We require g(z) # 0 for all z + zo in a neighborhood of zo. Then 
we say 


f(z) = O(g(z)) (1.1) 


if there exists a constant M such that | f(z)| < M|g(z)| as z —> zo. For example, 
(n + 1)/(3n?) = O(n—!), and it is understood that we are considering n — oo. If 
limz-+z9 f(z)/g(z) = 0, then we say 


f(z) = O(g(z)). (1.2) 


For example, f(xo + h) — f(xo) = hf'(xo) + olh) as h — 0 if f is differentiable at 
xo. The same notation can be used for describing the convergence of a sequence {xn} 
as n — œ, by letting f(n) = xn. 

Taylor’s theorem provides a polynomial approximation to a function f. Suppose 
f has finite (n + 1)th derivative on (a, b) and continuous nth derivative on [a, b]. Then 
for any xo € [a, b] distinct from x, the Taylor series expansion of f about xo is 


n 


fa= 05 Za — xo) + Rn, (1.3) 


i= 0 


where f O(x9) is the ith derivative of f evaluated at xo, and 


fF VO — xo)" t! (1.4) 


n 


Tin 7 


for some point é in the interval between x and xg. As |x — xo| > 0, note that Ry = 
Ole — xol”t»). 

The multivariate version of Taylor’s theorem is analogous. Suppose f is a 
real-valued function of a p-dimensional variable x, possessing continuous partial 
derivatives of all orders up to and including n + 1 with respect to all coordinates, in 
an open convex set containing x and xo Æ x. Then 


n 1 ; 
F(®) = f0) + X` | DOSS X0, x — xo) + Rn, (1.5) 
i=1 ` 


where 


DOL Ý Sata 


j=l Ji=1 


_) I} (1.6) 
=x“ k=] 
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and 


1 


= MHI f- x— 
oa EEE (f; E, x — xo) (1.7) 


for some & on the line segment joining x and xo. As |x — xo| — 0, note that R, = 
Olx — xo|"*"). 

The Euler—Maclaurin formula is useful in many asymptotic analyses. If f has 
2n continuous derivatives in [0, 1], then 


1 
T oa Oo 
0 2 


3 baf DA) = FDO) _ ban fE 


(2i)! Rna)! ” (R 


i=0 


where 0 <€ <1, f O) is the jth derivative of f, and b; = B;(0) can be determined 
using the recursion relation 


m 


m+1 
a ; ) ay ome ne (1.9) 


j=0 


initialized with Bo(z) = 1. The proof of this result is based on repeated integrations 
by parts [376]. 

Finally, we note that it is sometimes desirable to approximate the derivative of 
a function numerically, using finite differences. For example, the ith component of 
the gradient of f at x can be approximated by 


df (x) £ fO + ciei) — f(X — Ee;) 


1.10 
dxi 2€; ( ) 


where €; is a small number and e; is the unit vector in the ith coordinate direction. 
Typically, one might start with, say, €; = 0.01 or 0.001 and approximate the desired 
derivative for a sequence of progressively smaller €;. The approximation will 
generally improve until €; becomes small enough that the calculation is degraded and 
eventually dominated by computer roundoff error introduced by subtractive cancel- 
lation. Introductory discussion of this approach and a more sophisticated Richardson 
extrapolation strategy for obtaining greater precision are provided in [376]. Finite 
differences can also be used to approximate the second derivative of f at x via 


dfs) 1 
dx; dx; 4eicj 


(r + ee; + €jej) — f(X + cie — €jej) 
— f(x — ciei + €jej) + f(x — Ee; — <ie)) (1.11) 


with similar sequential precision improvements. 
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1.3 STATISTICAL NOTATION AND PROBABILITY 
DISTRIBUTIONS 


We use capital letters to denote random variables, such as Y or X, and lowercase 
letters to represent specific realized values of random variables such as y or x. The 
probability density function of X is denoted f, the cumulative distribution function 
is F. We use the notation X ~ f(x) to mean that X is distributed with density f(x). 
Frequently, the dependence of f(x) on one or more parameters also will be denoted 
with a conditioning bar, as in f(x|a, B). Because of the diversity of topics covered in 
this book, we want to be careful to distinguish when f (x|œ) refers to a density function 
as opposed to the evaluation of that density at a point x. When the meaning is unclear 
from the context, we will be explicit, for example, by using f(-|~) to denote the 
function. When it is important to distinguish among several densities, we may adopt 
subscripts referring to specific random variables, so that the density functions for 
X and Y are fx and fy, respectively. We use the same notation for distributions of 
discrete random variables and in the Bayesian context. 

The conditional distribution of X given that Y equals y (.e., X|Y = y) is de- 
scribed by the density denoted f(x|y), or fx|y(x|y). In this case, we write that X|Y has 
density f(x|Y). For notational simplicity we allow density functions to be implicitly 
specified by their arguments, so we may use the same symbol, say f, to refer to many 
distinct functions, as in the equation f(x, yu) = f(xly, u) f Olu). Finally, f(X) and 
F(X) are random variables: the evaluations of the density and cumulative distribution 
functions, respectively, at the random argument X. 

The expectation of a random variable is denoted E{X}. Unless specifically 
mentioned, the distribution with respect to which an expectation is taken is the dis- 
tribution of X or should be implicit from the context. To denote the probability of 
an event A, we use P[A] = E{1;4}}. The conditional expectation of X|Y = y is 
E{X|y}. When Y is unknown, E{X|Y} is a random variable that depends on Y. Other 
attributes of the distribution of X and Y include var{X}, cov{X, Y}, cor{X, Y}, and 
cv{X} = var{x}1/2 /E{X}. These quantities are the variance of X, the covariance and 
correlation of X and Y, and the coefficient of variation of X, respectively. 

A useful result regarding expectations is Jensen’s inequality. Let g be a convex 
function on a possibly infinite open interval J, so 


gx + (1 — A)y) < Ag) + A — A)g(y) (1.12) 


for all x, y € J and all 0 < A < 1. Then Jensen’s inequality states that E{g(X)} > 
g(E{X}) for any random variable X having P[X € I] = 1. 

Tables 1.1, 1.2, and 1.3 provide information about many discrete and contin- 
uous distributions used throughout this book. We refer to the following well-known 
combinatorial constants: 


n! = n(n — 1)(n —2)---(3)(2)(1)_ with 0! = 1, (1.13) 
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n n! m 
Hehi hi =\ `k, 1.15 
ha ` | Tem where n Sok (1.15) 


i=1 


Be o fory 12:55 rere 
a Jo. t! exp{—t}dt for general r > 0. f 


It is worth knowing that 


r(})=v7 and P(n+3)51x3x5x: x n= DV7/2" 


for positive integer n. 
Many of the distributions commonly used in statistics are members of an 
exponential family. A k-parameter exponential family density can be expressed as 


k 
f(xly) = ci(a)er(y) exp pa naan) (1.17) 
i=1 
for nonnegative functions cı and c2. The vector y denotes the familiar parameters, such 
as à for the Poisson density and p for the binomial density. The real-valued 6;(y) are 
the natural, or canonical, parameters, which are usually transformations of y. The 
yi(x) are the sufficient statistics for the canonical parameters. It is straightforward 
to show 


E{y(X)} =«'(6) (1.18) 

and 
var{y(X)} =x" (0), (1.19) 
where K(0) = — log c3(0), letting c3(0) denote the reexpression of c2(y) in terms of 


the canonical parameters 0 = (01, ..., Ok), and y(X) = (y1(X), ..., ye(X)). These 
results can be rewritten in terms of the original parameters y as 


k 
doily) d 
r OD 0 T 1.20 
> dy; vil } dy; og c2(y) ( ) 
and 
k 3 a 
diy) C E doy) 
“fy a no} = en Zaps iy; vc}. (1.21) 


Example 1.1 (Poisson) The Poisson distribution belongs to the exponential family 
with cy(x) = 1/x!, co(A) = exp{—A}, y(x) = x, and 6() = log A. Deriving moments 
in terms of 0, we have x(0) = exp{0}, so E{X} = K'(0) = exp{0} = A and var{X} = 
K” (0) = exp{6} = A. The same results may be obtained with (1.20) and (1.21), noting 
that d0/dà = 1/X. For example, (1.20) gives E {X/A} = 1. 


It is also important to know how the distribution of a random variable changes 
when it is transformed. Let X = (Xj,..., Xp) denote a p-dimensional random 
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variable with continuous density function f. Suppose that 
U = 8X) = 818), ..., 8X) = (U1, .--, Up), (1.22) 


where g is a one-to-one function mapping the support region of f onto the space of 
all u = g(x) for which x satisfies f(x) > 0. To derive the probability distribution of 
U from that of X, we need to use the Jacobian matrix. The density of the transformed 
variables is 


fa) = f(g u) Jw], (1.23) 


where |J(u)| is the absolute value of the determinant of the Jacobian matrix of g7! 


evaluated at u, having (i, j)th element dx;/du j, where these derivatives are assumed 
to be continuous over the support region of U. 


1.4 LIKELIHOOD INFERENCE 


If X,,..., Xn are independent and identically distributed (i.i.d.) each having density 
f(x | 0) that depends on a vector of p unknown parameters 0 = (01, ..., Op), then the 
joint likelihood function is 


LO) = [| fxil6). (1.24) 
i=1 


When the data are not i.i.d., the joint likelihood is still expressed as the joint density 
f(X1, .--, Xn|0) viewed as a function of 0. 

The observed data, x1, . . . , Xn, might have been realized under many different 
values for 0. The parameters for which observing x;,..., X, would be most likely 
constitute the maximum likelihood estimate of 0. In other words, if Ò is the function 
of xj,...,X, that maximizes L(6), then 6 = ÒX; , . - . , Xn) is the maximum likeli- 
hood estimator (MLE) for 0. MLEs are invariant to transformation, so the MLE of a 
transformation of 0 equals the transformation of 0. 

It is typically easier to work with the log likelihood function, 


1(0) = log L(0), (1.25) 


which has the same maximum as the original likelihood, since log is a strictly mono- 
tonic function. Furthermore, any additive constants (involving possibly x1, ..., Xp 
but not 0) may be omitted from the log likelihood without changing the location of its 
maximum or differences between log likelihoods at different 0. Note that maximizing 
L(@) with respect to # is equivalent to solving the system of equations 


(6) = 0, (1.26) 


where 


(6) = (>. ae =) 
dO dO, 
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is called the score function. The score function satisfies 
E{l'(@)} = 0, (1.27) 


where the expectation is taken with respect to the distribution of X1, ..., X,. Some- 
times an analytical solution to (1.26) provides the MLE; this book describes a variety 
of methods that can be used when the MLE cannot be solved for in closed form. It 
is worth noting that there are pathological circumstances where the MLE is not a 
solution of the score equation, or the MLE is not unique; see [127] for examples. 

The MLE has a sampling distribution because it depends on the realization 
of the random variables X;,..., X,. The MLE may be biased or unbiased for 0, yet 
under quite general conditions it is asymptotically unbiased as n — oo. The sampling 
variance of the MLE depends on the average curvature of the log likelihood: When the 
log likelihood is very pointy, the location of the maximum is more precisely known. 

To make this precise, let 1”(0) denote the p x p matrix having (i, j)th element 
given by d71(0)/(do;d0 j). The Fisher information matrix is defined as 


KCO) = EMOW OAT = —E{l'"()}, (1.28) 


where the expectations are taken with respect to the distribution of X1, ..., Xn. The 
final equality in (1.28) requires mild assumptions, which are satisfied, for example, in 
exponential families. I(@) may sometimes be called the expected Fisher information 
to distinguish it from —I” (0), which is the observed Fisher information. There are 
two reasons why the observed Fisher information is quite useful. First, it can be 
calculated even if the expectations in (1.28) cannot easily be computed. Second, it is 
a good approximation to I(@) that improves as n increases. 

Under regularity conditions, the asymptotic variance—covariance matrix of 
the MLE @ is 1(6*)—!, where 0* denotes the true value of 0. Indeed, as n — oo, the 
limiting distribution of 6 is N pO, 1(6*)—!). Since the true parameter values are 
unknown, 1(6*)~! must be estimated in order to estimate the variance—covariance 
matrix of the MLE. An obvious approach is to use rô). Alternatively, it is also 
reasonable to use —I” (6)-!, Standard errors for individual parameter MLEs can 
be estimated by taking the square root of the diagonal elements of the chosen 
estimate of 1(@*)~!. A thorough introduction to maximum likelihood theory and 
the relative merits of these estimates of 1(6*)~! can be found in [127, 182, 371, 470]. 

Profile likelihoods provide an informative way to graph a higher-dimensional 
likelihood surface, to make inference about some parameters while treating others 
as nuisance parameters, and to facilitate various optimization strategies. The profile 
likelihood is obtained by constrained maximization of the full likelihood with respect 
to parameters to be ignored. If 0 = (m, @), then the profile likelihood for @ is 


Lol â(ġ)) = max L(1t, 9). (1.29) 


Thus, for each possible @, a value of m is chosen to maximize L(a, @). This optimal 
H is a function of @. The profile likelihood is the function that maps ¢ to the value 
of the full likelihood evaluated at @ and its corresponding optimal u. Note that the @ 
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that maximizes the profile likelihood L(@| f1(@)) is also the MLE for @ obtained from 
the full likelihood L(t, @). Profile likelihood methods are examined in [23]. 


1.5 BAYESIAN INFERENCE 


In the Bayesian inferential paradigm, probability distributions are associated with 
the parameters of the likelihood, as if the parameters were random variables. The 
probability distributions are used to assign subjective relative probabilities to regions 
of parameter space to reflect knowledge (and uncertainty) about the parameters. 

Suppose that X has a distribution parameterized by 0. Let f(0) represent the 
density assigned to 0 before observing the data. This is called a prior distribution. 
It may be based on previous data and analyses (e.g., pilot studies), it may represent 
a purely subjective personal belief, or it may be chosen in a way intended to have 
limited influence on final inference. 

Bayesian inference is driven by the likelihood, often denoted L(6|x) in this 
context. Having established a prior distribution for 0 and subsequently observed data 
yielding a likelihood that is informative about 0, one’s prior beliefs must be updated 
to reflect the information contained in the likelihood. The updating mechanism is 
Bayes’ theorem: 


F(O|x) = cf) f0) = cf @LO|x), (1.30) 


where f(@|x) is the posterior density of 0. The posterior distribution for 0 is used 
for statistical inference about 0. The constant c equals 1/ f f(OL(0|x) d0 and is often 
difficult to compute directly, although some inferences do not require c. This book 
describes a large variety of methods for enabling Bayesian inference, including the 
estimation of c. 

Let 6 be the posterior mode, and let 6* be the true value of 0. The posterior 
distribution of 6 converges to N(6™, 1(0*)—!) as n —> oo, under regularity conditions. 
Note that this is the same limiting distribution as for the MLE. Thus, the posterior 
mode is of particular interest as a consistent estimator of 0. This convergence reflects 
the fundamental notion that the observed data should overwhelm any prior asm — oo. 

Bayesian evaluation of hypotheses relies upon the Bayes factor. The ratio of 
posterior probabilities of two competing hypotheses or models, Hı and Hp, is 

P[A2|x] _ Pl A] 


= B 1.31 
PIAi\x] Pim] ee 


where P[Hj|x] denotes posterior probability, P[ Hj] denotes prior probability, and 


_ falm) = J f(02|H2) f(x102, H2) d02 
fOD) f Fi) A) F(KIO1, H) dO, 


Boy (1.32) 


with 6; denoting the parameters corresponding to the ith hypothesis. The quantity 
Bp, is the Bayes factor; it represents the factor by which the prior odds are multiplied 
to produce the posterior odds, given the data. The hypotheses Hı and H> need not be 
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nested as for likelihood ratio methods. The computation and interpretation of Bayes 
factors is reviewed in [365]. 

Bayesian interval estimation often relies on a 95% highest posterior density 
(HPD) region. The HPD region for a parameter is the region of shortest total length 
containing 95% of the posterior probability for that parameter for which the posterior 
density for every point contained in the interval is never lower than the density for 
every point outside the interval. For unimodal posteriors, the HPD is the narrowest 
possible interval containing 95% of the posterior probability. A more general interval 
for Bayesian inference is a credible interval. The 100(1 — «)% credible interval is 
the region between the œ/2 and 1 — a/2 quantiles of the posterior distribution. When 
the posterior density is symmetric and unimodal, the HPD and the credible interval 
are identical. 

A primary benefit of the Bayesian approach to inference is the natural manner 
in which resulting credibility intervals and other inferences are interpreted. One may 
speak of the posterior probability that the parameter is in some range. There is also 
a sound philosophical basis for the Bayesian paradigm; see [28] for an introduction. 
Gelman et al. provide a broad survey of Bayesian theory and methods [221]. 

The best prior distributions are those based on prior data. A strategy that is 
algebraically convenient is to seek conjugacy. A conjugate prior distribution is one that 
yields a posterior distribution in the same parametric family as the prior distribution. 
Exponential families are the only classes of distributions that have natural conjugate 
prior distributions. 

When prior information is poor, itis important to ensure that the chosen prior dis- 
tribution does not strongly influence posterior inferences. A posterior that is strongly 
influenced by the prior is said to be highly sensitive to the prior. Several strategies are 
available to reduce sensitivity. The simplest approach is to use a prior whose support 
is dispersed over a much broader region than the parameter region supported by the 
data, and fairly flat over it. A more formal approach is to use a Jeffreys prior [350]. 
In the univariate case, the Jeffreys prior is f(0) « I (0)~'/2, where I(0) is the Fisher 
information; multivariate extensions are possible. In some cases, the improper prior 
f(0) « 1 may be considered, but this can lead to improper posteriors (i.e., not inte- 
grable), and it can be unintentionally informative depending on the parameterization 
of the problem. 


Example 1.2 (Normal—Normal Conjugate Bayes Model) Consider Bayesian 
inference based on observations of i.i.d. random variables X1,..., Xn with density 
X;i|0 ~ NO, o°) where o° is known. For such a likelihood, a normal prior for 0 is 
conjugate. Suppose the prior is 9 ~ N(u, t?). The posterior density is 


FOL « FO) || fæ (1.33) 


i=1 


DERS. n e _ py 
exp d E u) + Lie o) )} (1.34) 
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1 uI + (nX)/02\* 1 
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where x is the sample mean. Recognizing (1.35) as being in the form of a normal 
distribution, we conclude that f(6|x) = N(un, TŻ), where 


1 
2 
= — 1.36 
n = Injo (1.30) 
and 
_ fb nx\ > 


Hence, a posterior 95% credibility interval for 0 is (un — 1.961, Un + 1.967,,). Since 
the normal distribution is symmetric, this is also the posterior 95% HPD for 0. 

For fixed ø, consider increasingly large choices for the value of t. The posterior 
variance for 0 converges to o~/n as t? —> oo. In other words, the influence of the 
prior on the posterior vanishes as the prior variance increases. Next, note that 


2 
Th 


lim 5 
n>% 0 /n 


This shows that the posterior variance for 0 and the sampling variance for the MLE, 
6 = X, are asymptotically equal, and the effect of any choice for t is washed out with 
increasing sample size. 

As an alternative to the conjugate prior, consider using the improper prior 
f(@) « 1. In this case, f(O|x) = N(x, o*/n), and a 95% posterior credibility 
interval corresponds to the standard 95% confidence interval found using frequentist 
methods. 


1.6 STATISTICAL LIMIT THEORY 


Although this book is mostly concerned with a pragmatic examination of how and 
why various methods work, it is useful from time to time to speak more precisely 
about the limiting behavior of the estimators produced by some procedures. We review 
below some basic convergence concepts used in probability and statistics. 

A sequence of random variables, X1, X2,..., is said to converge in proba- 
bility to the random variable X if limp—oo P[|Xn — X| < e] = 1 for every e > 0. 
The sequence converges almost surely to X if P[lim, >o |Xn — X| < e] = 1 for 
every € > 0. The variables converge in distribution to the distribution of X if 
limn—+oo Fy, (x) = Fx(x) for all points x at which Fy(x) is continuous. The vari- 
able X has property A almost everywhere if PA] = f 1,4) fx(x) dx = 1. 

Some of the best-known convergence theorems in statistics are the laws of large 
numbers and the central limit theorem. Fori.i.d. sequences of one-dimensional random 
variables X1, X2, . . „let Xn = So, X;/n. The weak law of large numbers states that 
X, converges in probability to u = E{X;} if E{|Xj|} < oo. The strong law of large 
numbers states that X, converges almost surely to u if E{|X;|} < co. Both results 


hold under the more stringent but easily checked condition that var{X;} = 07 < oo. 
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If 6 is a parameter and T, is a statistic based on X,..., Xn, then Tn is said to 
be weakly or strongly consistent for 0 if T, converges in probability or almost surely 
(respectively) to 0. T, is unbiased for 0 if E{T,,} = 0; otherwise the bias is E{T,,} — 0. 
If the bias vanishes as n — oo, then T, is asymptotically unbiased. 

A simple form of the central limit theorem is as follows. Suppose that i.i.d. ran- 
dom variables X1, ..., Xn have mean u and finite variance o”, and that Efexp{tx;}} 
exists in a neighborhood of t = 0. Then the random variable T, = /n(X, — 1)/o 
converges in distribution to a normal random variable with mean zero and variance 
one, as n — oo. There are many versions of the central limit theorem for various 
situations. Generally speaking, the assumption of finite variance is critical, but the 
assumptions of independence and identical distributions can be relaxed in certain 
situations. 


1.7 MARKOV CHAINS 


We offer here a brief introduction to univariate, discrete-time, discrete-state-space 
Markov chains. We will use Markov chains in Chapters 7 and 8. A thorough introduc- 
tion to Markov chains is given in [556], and higher-level study is provided in [462, 
543]. 

Consider a sequence of random variables {X cae t=0,1,..., where each X” 
may equal one of a finite or countably infinite number of possible values, called states. 
The notation X = j indicates that the process is in state j at time t. The state space, 
S, is the set of possible values of the random variable X. 

A complete probabilistic specification of X, ..., X would be to write their 
joint distribution as the product of conditional distributions of each random variable 
given its history, or 


P [x®, aR] =P [x xO aD] 


x P [X00] 2,02) 300 


xP [x] w] P [x] . (1.38) 


A simplification of (1.38) is possible under the conditional independence assumption 
that 


pjo 


,...,a0-D] = p [x 


ae (1.39) 


Here the next state observed is only dependent upon the present state. This is the 
Markov property, sometimes called “one-step memory.” In this case, 


PO ee XO | aaa 


ae 


xP Bee ae -P kaca P [xo] . (1.40) 


1.7 MARKOV CHAINS 15 


TABLE 1.4 San Francisco rain data considered in Example 1.3. 


Wet Today Dry Today 
Wet Yesterday 418 256 
Dry Yesterday 256 884 


Let pe be the probability that the observed state changes from state i at time t 


to state j at time t + 1. The sequence {xO}, t=0,1,... is a Markov chain if 


po =P [ze = i XO = xO XDD Osi 


Sf | SOS i) OS] (1.41) 


for allt = 0, 1,... and x®, x®,..., x6), i, j € S. The quantity pe is called the 
one-step transition probability. If none of the one-step transition probabilities change 
with f, then the chain is called time homogeneous, and By = pi;.Ifany of the one-step 
transition probabilities change with ¢, then the chain is called time-inhomogeneous. 

A Markov chain is governed by a transition probability matrix. Suppose that 
the s states in S are, without loss of generality, all integer valued. Then P denotes s x s 
transition probability matrix of a time-homogeneous chain, and the (i, j)th element 
of P is pij. Each element in P must be between zero and one, and each row of the 
matrix must sum to one. 


Example 1.3 (San Francisco Weather) Let us consider daily precipitation out- 
comes in San Francisco. Table 1.4 gives the rainfall status for 1814 pairs of consecutive 
days [488]. The data are taken from the months of November through March, starting 
in November of 1990 and ending in March of 2002. These months are when San 
Francisco receives over 80% of its precipitation each year, virtually all in the form of 
rain. We consider a binary classification of each day. A day is considered to be wet if 
more than 0.01 inch of precipitation is recorded and dry otherwise. Thus, S has two 
elements: “wet” and “dry.” The random variable corresponding to the state for the tth 
day is X, 

Assuming time homogeneity, an estimated transition probability matrix for X 
would be 


~ [0.620 0.380 
p= | | (1.42) 


0.224 0.775 


Clearly, wet and dry weather states are not independent in San Francisco, as a wet 
day is more likely to be followed by a wet day and pairs of dry days are highly 
likely. 
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The limiting theory of Markov chains is important for many of the methods 
discussed in this book. We now review some basic results. 

A state to which the chain returns with probability 1 is called a recurrent state. 
A state for which the expected time until recurrence is finite is called nonnull. For 
finite state spaces, recurrent states are nonnull. 

A Markov chain is irreducible if any state j can be reached from any state i ina 
finite number of steps for all i and j. In other words, for each i and j there must exist 
m > 0 such that P xe) =i x = j] > 0. A Markov chain is periodic if it can 
visit certain portions of the state space only at certain regularly spaced intervals. State 
j has period d if the probability of going from state j to state j in n steps is 0 for all 
n not divisible by d. If every state in a Markov chain has period 1, then the chain is 
called aperiodic. A Markov chain is ergodic if it is irreducible, aperiodic, and all its 
states are nonnull and recurrent. 

Let x denote a vector of probabilities that sum to one, with ith element 7r; 
denoting the marginal probability that X = i. Then the marginal distribution of 
XC+D must be xP. Any discrete probability distribution x such that xTP = x! is 
called a stationary distribution for P, or for the Markov chain having transition proba- 
bility matrix P. If X follows a stationary distribution, then the marginal distributions 
of X and X“+ are identical. 

If a time-homogeneous Markov chain satisfies 


Ti Pij = Mj P ji (1.43) 


for alli, j € S, then ~ is a stationary distribution for the chain, and the chain is called 
reversible because the joint distribution of a sequence of observations is the same 
whether the chain is run forwards or backwards. Equation (1.43) is called the detailed 
balance condition. 

Ifa Markov chain with transition probability matrix P and stationary distribution 
x is irreducible and aperiodic, then x is unique and 


lim P | xe — i xO = i =n, (1.44) 


n> oo 


where zr; is the jth element of x. The 7j are the solutions of the following set of 
equations: 


mj = 0, aS 1, and Temps for each j € S. (1.45) 
ics ics 
We can restate and extend (1.44) as follows. If X®, X®,... are realizations 


from an irreducible and aperiodic Markov chain with stationary distribution x, then 
X™ converges in distribution to the distribution given by x, and for any function h, 


1< © 
~S7n(X) > Eg {h(X)} (1.46) 
n 


t=1 


almost surely as n — oo, provided FE, {|h(X)|} exists [605]. This is one form of the 
ergodic theorem, which is a generalization of the strong law of large numbers. 
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We have considered here only Markov chains for discrete state spaces. In 
Chapters 7 and 8 we will apply these ideas to continuous state spaces. The prin- 
ciples and results for continuous state spaces and multivariate random variables are 
similar to the simple results given here. 


1.8 COMPUTING 


If you are new to computer programming, or wishing to learn a new language, there is 
no better time to start than now. Our preferred language for teaching and learning about 
statistical computing is R (freely available at www.r-project.org), but we avoid any 
language-specific limitations in this text. Most of the methods described in this book 
can also be easily implemented in other high-level computer languages for mathemat- 
ics and statistics such as MATLAB. Programming in Java and low-level languages 
such as C++ and FORTRAN is also possible. The tradeoff between implementation 
ease for high-level languages and computation speed for low-level languages often 
guides this selection. Links to these and other useful software packages, including 
libraries of code for some of the methods described in this book, are available on the 
book website. 

Ideally, your computer programming background includes a basic understand- 
ing of computer arithmetic: how real numbers and mathematical operations are 
implemented in the binary world of a computer. We focus on higher-level issues 
in this book, but the most meticulous implementation of the algorithms we 
describe can require consideration of the vagaries of computer arithmetic, or use 
of available routines that competently deal with such issues. Interested readers may 
refer to [383]. 


PART I 
OPTIMIZATION 


F statistics we need to optimize many functions, including likelihood func- 
tions and generalizations thereof, Bayesian posterior distributions, entropy, 
and fitness landscapes. These all describe the information content in some 
observed data. Maximizing these functions can drive inference. 

How to maximize a function depends on several criteria including the 
nature of the function and whatis practical. You could arbitrarily choose values 
to input to your function to eventually find a very good choice, or you can 
do a more guided search. Optimization procedures help guide search efforts, 
some employing more mathematical theory and others using more heuristic 
principles. Options include methods that rely on derivatives, derivative-free 
approaches, and heuristic strategies. In the next three chapters, we describe 
some of the statistical contexts within which optimization problems arise and 
a variety of methods for solving them. 

In Chapter 2 we consider fundamental methods for optimizing smooth 
nonlinear equations. Such methods are applicable to continuous-valued func- 
tions, as when finding the maximum likelihood estimate of a continuous func- 
tion. In Chapter 3 we consider a variety of strategies for combinatorial opti- 
mization. These algorithms address problems where the functions are discrete 
and usually formidably complex, such as finding the optimal set of predictors 
from a large set of potential explanatory variables in multiple regression anal- 
ysis. The methods in Chapters 2 and 3 originate from mathematics and com- 
puter science but are used widely in statistics. The expectation—maximization 
(EM) algorithm in Chapter 4 is focused on a problem that is frequently en- 
countered in statistical inference: How do you maximize a likelihood function 
when some of the data are missing? It turns out that this powerful algorithm 
can be used to solve many other statistical problems as well. 
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CHAPTER 2 


OPTIMIZATION AND SOLVING 
NONLINEAR EQUATIONS 


Maximum likelihood estimation is central to statistical inference. Long hours can 
be invested in learning about the theoretical performance of MLEs and their analytic 
derivation. Faced with a complex likelihood lacking analytic solution, however, many 
people are unsure how to proceed. 

Most functions cannot be optimized analytically. For example, maximizing 
g(x) = (log x)/(1 + x) with respect to x by setting the derivative equal to zero and 
solving for x leads to an algebraic impasse because 1 + 1/x — log x = 0 has no 
analytic solution. Many realistic statistical models induce likelihoods that cannot 
be optimized analytically—indeed, we would argue that greater realism is strongly 
associated with reduced ability to find optima analytically. 

Statisticians face other optimization tasks, too, aside from maximum likelihood. 
Minimizing risk in a Bayesian decision problem, solving nonlinear least squares prob- 
lems, finding highest posterior density intervals for many distributions, and a wide 
variety of other tasks all involve optimization. Such diverse tasks are all versions of 
the following generic problem: Optimize a real-valued function g with respect to its 
argument, a p-dimensional vector x. In this chapter, we will limit consideration to 
g that are smooth and differentiable with respect to x; in Chapter 3 we discuss opti- 
mization when g is defined over a discrete domain. There is no meaningful distinction 
between maximization and minimization, since maximizing a function is equivalent 
to minimizing its negative. As a convention, we will generally use language suggestive 
of seeking a maximum. 

For maximum likelihood estimation, g is the log likelihood function /, and x is 
the corresponding parameter vector 6. If 0 is an MLE, it maximizes the log likelihood. 
Therefore 6 is a solution to the score equation 


(0) =0, (2.1) 


where 


ee 7 aay 
Oe “dO, °°" "” dO, 


and 0 is a column vector of zeros. 
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Immediately, we see that optimization is intimately linked with solving non- 
linear equations. Indeed, one could reinterpret this chapter as an introduction to root 
finding rather than optimization. Finding an MLE amounts to finding a root of the 
score equation. The maximum of g is a solution to g'(x) = 0. (Conversely, one may 
also turn a univariate root-finding exercise into an optimization problem by minimiz- 
ing | g (x)| with respect to x, where g’ is the function whose root is sought.) 

The solution of g’(x) = 0 is most difficult when this set of equations has a 
solution that cannot be determined analytically. In this case, the equations will be 
nonlinear. Solving linear equations is easy, however there is another class of difficult 
optimization problems where the objective function itself is linear and there are linear 
inequality constraints. Such problems can be solved using linear programming tech- 
niques such as the simplex method [133, 198, 247, 497] and interior point methods 
[347, 362, 552]. Such methods are not covered in this book. 

For smooth, nonlinear functions, optima are routinely found using a variety 
of off-the-shelf numerical optimization software. Many of these programs are very 
effective, raising the question of whether optimization is a solved problem whose 
study here might be a low priority. For example, we have omitted the topic of uniform 
random number generation from this book—despite its importance in the statistical 
computing literature—because of the widespread use of high-quality prepackaged 
routines that accomplish the task. Why, then, should optimization methods be treated 
differently? The difference is that optimization software is confronted with a new 
problem every time the user presents a new function to be optimized. Even the best 
optimization software often initially fails to find the maximum for tricky likelihoods 
and requires tinkering to succeed. Therefore, the user must understand enough about 
how optimization works to tune the procedure successfully. 

We begin by studying univariate optimization. Extensions to multivariate prob- 
lems are described in Section 2.2. Optimization over discrete spaces is covered 
in Chapter 3, and an important special case related to missing data is covered in 
Chapter 4. 

Useful references on optimization methods include [153, 198, 247, 475, 486, 
494]. 


2.1 UNIVARIATE PROBLEMS 


A simple univariate numerical optimization problem that we will discuss throughout 
this section is to maximize 


log x 
1+x 


g(x) = (2.2) 


with respect to x. Since no analytic solution can be found, we resort to iterative 
methods that rely on successive approximation of the solution. Graphing g(x) in 
Figure 2.1, we see that the maximum is around 3. Therefore it might be reasonable 
to use x® = 3.0 as an initial guess, or starting value, for an iterative procedure. An 
updating equation will be used to produce an improved guess, xt), from the most 
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FIGURE 2.1 The maximum of g(x) = (log x)/(1 + x) occurs at x* ~ 3.59112, indicated by 
the vertical line. 


recent value x, for t = 0, 1,2, ... until iterations are stopped. The update may be 
based on an attempt to find the root of 


p 1 + 1/x— logx 
ga = — eae 
(1+ x) 


or on some other rationale. 

The bisection method illustrates the main components of iterative root-finding 
procedures. If g’ is continuous on [ao, bo] and g'(ao)g' (bo) < 0, then the intermediate 
value theorem [562] implies that there exists at least one x* € [ao, bo] for which 
g'(x*) = 0 and hence x* is a local optimum of g. To find it, the bisection method 
systematically shrinks the interval from [ao, bo] to [a1, b1] to [a2, b2] and so on, 
where [dg, bo] D [a1, b1] D [a2, b2] D --- and so forth. 

Let x = (ag + bo)/2 be the starting value. The updating equations are 


lar, x] if g'a) QO) < 0, 
,b = 2.3 
ee k if g'(a)g' O) > 0 m 
and 
1 
x0) = 5 (am + br). (2.4) 


If g has more than one root in the starting interval, it is easy to see that bisection will 
find one of them, but will not find the rest. 


Example 2.1 (A Simple Univariate Optimization) To find the value of x 
maximizing (2.2), we might take aọ = 1, bọ = 5, and xO = 3, Figure 2.2 illustrates 
the first few steps of the bisection algorithm for this simple function. 
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FIGURE 2.2 Illustration of the bisection method from Example 2.1. The top portion of this 
graph shows g'(x) and its root at x*. The bottom portion shows the first three intervals obtained 
using the bisection method with (ao, bo) = (1, 5). The tth estimate of the root is at the center 
of the ¢th interval. 


Suppose the true maximum of g(x) with respect to x is achieved at x*. The 
updating equation of any iterative procedure should be designed to encourage x —> 
x* as t increases. Of course there is no guarantee that x will converge to anything, 
let alone to x*. 

In practice, we cannot allow the procedure to run indefinitely, so we require a 
stopping rule, based on some convergence criteria, to trigger an end to the successive 
approximation. At each iteration, the stopping rule should be checked. When the 
convergence criteria are met, the new x“*+!) is taken as the solution. There are two 
reasons to stop: if the procedure appears to have achieved satisfactory convergence 
or if it appears unlikely to do so soon. 

It is tempting to monitor convergence by tracking the proximity of g/(x“+)) 
to zero. However, large changes from x to x“+) can occur even when g/(x"*+)) is 
very small; therefore a stopping rule based directly on g’(x“*)) is not very reliable. 
On the other hand, a small change from x) to x“*+ is most frequently associated 
with g/(x“+) near zero. Therefore, we typically assess convergence by monitoring 
|x) — x] and use g/(x“*)) as a backup check. 

The absolute convergence criterion mandates stopping when 


Rete ~ x] <e, (2.5) 


where € is a constant chosen to indicate tolerable imprecision. For bisection, it is easy 
to confirm that 


by — ay = 27'(bo — ao). (2.6) 
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A true error tolerance of [x — x" < ŝis achieved when 2—"t) (bo — ag) < ô, 
which occurs once t > log, {(bo — ao)/ô} — 1. Reducing ô tenfold therefore requires 
an increase in f of logy 10 ~ 3.3. Hence, three or four iterations are needed to achieve 
each extra decimal place of precision. 

The relative convergence criterion mandates stopping when iterations have 
reached a point for which 


(+1) _ yO 
jar) = x0| 


ko] <€. (2.7) 
This criterion enables the specification of a target precision (e.g., within 1%) without 
worrying about the units of x. 

Preference between the absolute and relative convergence criteria depends on 
the problem at hand. If the scale of x is huge (or tiny) relative to €, an absolute 
convergence criterion may stop iterations too reluctantly (or too soon). The relative 
convergence criterion corrects for the scale of x, but can become unstable if x 
values (or the true solution) lie too close to zero. In this latter case, another option is 
to monitor relative convergence by stopping when 


+D _ yl) 
jar) = 3x0| 


ko] T <€. 


Bisection works when g’ is continuous. Taking limits on both sides of (2.6) 
implies lim;— co ay = lim;-+ oo br; therefore the bisection method converges to some 
point x‘), The method always ensures that g’(a;)g’(b;) < 0; continuity therefore 
implies that g'(x)? < 0. Thus g'(x) must equal zero, which proves that x‘ 
is a root of g. In other words, the bisection method is—in theory—guaranteed to 
converge to a root in [aọ, Do]. 

In practice, numerical imprecision in a computer may thwart convergence. For 
most iterative approximation methods, it is safer to add a small correction to a previ- 
ous approximation than to initiate a new approximation from scratch. The bisection 
method is more stable numerically when the updated endpoint is calculated as, say, 
at+1 = a + (b; — a;)/2 instead of ay4) = (a; + b,)/2. Yet, even carefully coded al- 
gorithms can fail, and optimization procedures more sophisticated than bisection can 
fail for all sorts of reasons. It is also worth noting that there are pathological circum- 
stances where the MLE is not a solution of the score equation or the MLE is not 
unique; see [127] for examples. 

Given such anomalies, it is important to include stopping rules that flag a failure 
to converge. The simplest such stopping rule is to stop after N iterations, regardless 
of convergence. It may also be wise to stop if one or more convergence measures like 

xCD — xO] or [xD — x©|/ |x], or |g/(x"+)| either fail to decrease or cycle 
over several iterations. The solution itself may also cycle unsatisfactorily. It is also 
sensible to stop if the procedure appears to be converging to a point at which g(x) is 
inferior to another value you have already found. This prevents wasted effort when a 
search is converging to a known false peak or local maximum. Regardless of which 
such stopping rules you employ, any indication of poor convergence behavior means 
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that x“+)) must be discarded and the procedure somehow restarted in a manner more 
likely to yield successful convergence. 

Starting is as important as stopping. In general, a bad starting value can lead to 
divergence, cycling, discovery of a misleading local maximum or a local minimum, 
or other problems. The outcome depends on g, the starting value, and the optimization 
algorithm tried. In general, it helps to start quite near the global optimum, as long 
as g is not virtually flat in the neighborhood containing x and x*. Methods for 
generating a reasonable starting value include graphing, preliminary estimates (e.g., 
method-of-moments estimates), educated guesses, and trial and error. If computing 
speed limits the total number of iterations that you can afford, it is wise not to invest 
them all in one long run of the optimization procedure. Using a collection of runs from 
multiple starting values (see the random starts local search method in Section 3.2) 
can be an effective way to gain confidence in your result and to avoid being fooled 
by local optima or stymied by convergence failure. 

The bisection method is an example of a bracketing method, that is to say, a 
method that bounds a root within a sequence of nested intervals of decreasing length. 
Bisection is quite a slow approach: It requires a rather large number of iterations to 
achieve a desired precision, relative to other methods discussed below. Other brack- 
eting methods include the secant bracket [630], which is equally slow after an initial 
period of greater efficiency, and the Illinois method [348], Ridders’s method [537], 
and Brent’s method [68], which are faster. 

Despite the relative slowness of bracketing methods, they have one significant 
advantage over the methods described in the remainder of this chapter. If g’ is contin- 
uous on [aọ, bo], a root can be found, regardless of the existence, behavior, or ease of 
deriving g”. Because they avoid worries about g” while performing relatively robustly 
on most problems, bracketing methods continue to be reasonable alternatives to the 
methods below that rely on greater smoothness of g. 


2.1.1 Newton’s Method 


An extremely fast root-finding approach is Newton’s method. This approach is also 
referred to as Newton—Raphson iteration, especially in univariate applications. Sup- 
pose that g’ is continuously differentiable and that g’(x*) + 0. At iteration ż, the 
approach approximates g’(x*) by the linear Taylor series expansion: 


0 = g'(x*) © g(x) + (x* — x) 97x). (2.8) 


Since g’ is approximated by its tangent line at x, it seems sensible to approximate 
the root of g’ by the root of the tangent line. Thus, solving for x* above, we obtain 


* a E (x) 


= as (t) 
SX = +h. (2.9) 


This equation describes an approximation to x* that depends on the current guess x” 
and a refinement AČ. Iterating this strategy yields the updating equation for Newton’s 
method: 


xftD — O Ẹ h®, (2.10) 
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FIGURE 2.3 Illustration of Newton’s method applied in Example 2.2. At the first step, 
Newton’s method approximates g’ by its tangent line at x, whose root x"! serves as the next 
approximation of the true root x*. The next step similarly yields x®, which is already quite 
close to x*. 


where A® = —g/(x) /g’(x). The same update can be motivated by analytically 
solving for the maximum of the quadratic Taylor series approximation to g(x*), 
namely g(x) + (x* — x)9/(x) + (x* — x)? g" (x®)/2. When the optimization 
of g corresponds to an MLE problem where Ô is a solution to /'(@) = 0, the updating 
equation for Newton’s method is 


O (0) 
l (0®) g 


oD = 9 (2.11) 


Example 2.2 (A Simple Univariate Optimization, Continued) Figure 2.3 illus- 
trates the first several iterations of Newton’s method applied to the simple function 
in (2.2). 

The Newton increment for this problem is given by 


1O (x +1)(1 + 1/x® — log x) 


= ; 2.12 
34 4/x0 + 1/QO — 2log x et 


Starting from x = 3.0, Newton’s method quickly finds x“ ~ 3.59112. For com- 
parison, the first five decimal places of x* are not correctly determined by the bisection 
method in Example 2.1 until iteration 19. 


Whether Newton’s method converges depends on the shape of g and the starting 
value. Figure 2.4 illustrates an example where the method diverges from its starting 
value. To better understand what ensures convergence, we must carefully analyze the 
errors at successive steps. 
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x 


FIGURE 2.4 Starting from x, Newton’s method diverges by taking steps that are increas- 
ingly distant from the true root, x*. 


Suppose g’ has two continuous derivatives and g’(x*) # 0. Since g”(x*) #0 
and g” is continuous at x*, there exists a neighborhood of x* within which g”(x) + 0 
for all x. Let us confine interest to this neighborhood, and define € = x — x*. 

A Taylor expansion yields 


1 
0 = g(x*) = g(x) + (x* — x) 9") + 5" — x) g" (A (2.13) 


for some q between x and x*. Rearranging terms, we find 


g” q) 


(t) O yk (y* _ (D2 
x +h x" = (x* —x EFTE 


(2.14) 


where AČ is the Newton update increment. Since the left-hand side equals x“) — x*, 
we conclude 


Mt: 
(+) — e02 E @ 
€ = (6) Dea) (2.15) 
Now, consider a neighborhood of x*, N3(x*) = [x* — 6, x* + ô], for ô > 0. Let 


g" (x1) 


JS 2g") 


; (2.16) 


x 
X1,.x2EN5(x*) 
Since 


g(x") 
2g" (x*) 


c(d) > | 
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as ô — 0, it follows that 5c(5) —> 0 as ô — 0. Let us choose ô such that dc(6) < 1. If 
x € N(x*), then (2.15) implies that 


2 
cde | < (ce) (2.17) 
Suppose that the starting value is not too bad, in the sense that | | = [x — x* | <ô. 
Then (2.17) implies that 
(c(5)8)" 
Je” a le (2.18) 
c(ô) 


which converges to zero as t > 00. Hence x —> x*. 

We have just proven the following theorem: If g” is continuous and x* is a 
simple root of g’, then there exists a neighborhood of x* for which Newton’s method 
converges to x* when started from any x) in that neighborhood. 

In fact, when g’ is twice continuously differentiable, is convex, and has a root, 
then Newton’s method converges to the root from any starting point. When starting 
from somewhere in an interval [a, b], another set of conditions one may check is as 
follows. If 


1. g"(x) #0 on [a, b], 

2. g” (x) does not change sign on [a, b], 

3. g'(a)g'(b) < 0, and 

4. |9/(a)/g"(a)| < b — a and |g/(b)/g"(b)| < b — a, 


m 


then Newton’s method will converge from any x) in the interval. Results like these 
can be found in many introductory numerical analysis books such as [131, 198, 247, 
376]. A convergence theorem with less stringent conditions is provided by [495]. 


2.1.1.1 Convergence Order The speed of a root-finding approach like 
Newton’s method is typically measured by its order of convergence. A method has 
convergence of order B if lim; €® = 0 and 


; Jer] 
jim, Poy =ç (2.19) 


for some constants c #0 and 6 > 0. Higher orders of convergence are better in 
the sense that precise approximation of the true solution is more quickly achieved. 
Unfortunately, high orders are sometimes achieved at the expense of robustness: Some 
slow algorithms are more foolproof than their faster counterparts. 

For Newton’s method, (2.15) shows us that 


eft) = A 
(©) 2g") 


(2.20) 


If Newton’s method converges, then continuity arguments allow us to note that 
the right-hand side of this equation converges to 9’”"(x*)/[2g’(x*)]. Thus, Newton’s 
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method has quadratic convergence (i.e., 6 = 2), and 


= g" a~) 
2g" (x*) 


Quadratic convergence is indeed fast: Usually the precision of the solution will double 
with each iteration. 

For bisection, the length of the bracketing interval exhibits a property anal- 
ogous to linear convergence ($8 = 1), in that it is halved at each iteration and 
lim; oo |e | = 0 if there is a root in the starting interval. However, the distances 
x” — x* need not shrink at every iteration, and indeed their ratio is potentially 
unbounded. Thus 


je€+D]| 


to eO 


may not exist for any 6 > 0, and bisection does not formally meet the definition for 
determining order of convergence. 

It is possible to use a foolproof bracketing method such as bisection to safeguard 
a faster but less reliable root-finding approach such as Newton’s method. Instead of 
viewing the bracketing approach as a method to generate steps, view it only as a 
method providing an interval within which a root must lie. If Newton’s method seeks 
a step outside these current bounds, the step must be replaced, curtailed, or (in the 
multivariate case) redirected. Some strategies are mentioned in Section 2.2 and in 
[247]. Safeguarding can reduce the convergence order of a method. 


2.1.2 Fisher Scoring 


Recall from Section 1.4 that /(@) can be approximated by —/’(6). Therefore when 
the optimization of g corresponds to an MLE problem, it is reasonable to replace 
—I'(6) in the Newton update with /(). This yields an updating increment of h® = 
1'(0)/1(0) where 1(0) is the expected Fisher information evaluated at 6. The 
updating equation is therefore 


att) — 6 + (6) 1(e)-!, (2.21) 


This approach is called Fisher scoring. 

Fisher scoring and Newton’s method both have the same asymptotic properties, 
but for individual problems one may be computationally or analytically easier than 
the other. Generally, Fisher scoring works better in the beginning to make rapid 
improvements, while Newton’s method works better for refinement near the end. 


2.1.3 Secant Method 


The updating increment for Newton’s method in (2.10) relies on the second deriva- 
tive, g(x). If calculating this derivative is difficult, it might be replaced by the 


2.1 UNIVARIATE PROBLEMS 31 


FIGURE2.5 The secant method locally approximates g’ using the secant line between x and 
x, The corresponding estimated root, x, is used with x") to generate the next approximation. 


discrete-difference approximation, [g'(x®) — g/(x“—P)]/(x — x€ D). The result is 
the secant method, which has updating equation 


x _ xD 


(t+1) _ ®© en) 
Xx =X = X 
g( GO = g' (x-9) 


(2.22) 


for t > 1. This approach requires two starting points, x and x, Figure 2.5 illus- 
trates the first steps of the method for maximizing the simple function introduced in 
Example 2.1. 

Under conditions akin to those for Newton’s method, the secant method will 
converge to the root x*. To find the order of convergence in this case, restrict attention 
to a suitably small interval [a, b], containing x xD, and x*, on which g(x) #0 
and g(x) + 0. Letting e+) = x“+) — x*, it is straightforward to show that 


=a —1 -1 
AH) = x — xD SOOO SEO Ere [Met] 
g(x) — gl (x@-D) x) — x-1) 
= AY BOOD, (2.23) 


where A® —> 1/g"(x*) as x — x* for continuous g”. 
To deduce a limit for B®, expand g’ in a Taylor series about x*: 


1 
g'(x) x g(x*) + (x ae aye Oe) + 50” _ ty g(x"), (2.24) 
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sO 


(t) olll (-* 
Fas eens S ek (2.25) 


ga) 
0 ae 


Similarly, g/(x°-))/eC—D ~ g"(x*) + €—Y g" (x*)/2. Thus, 


6) n eD o!"(x*) 


Oy IN 7 * = 
B g (x 5G xe) aa 


(2.26) 


and a careful examination of the errors shows this approximation to be exact as 
x —> x*, Thus, 


ETD w gO) (2.27) 
where 
Wt x 
d® +> 8 0°) = dast —> œ. 
2g" (x*) 


To find the order of convergence for the secant method, we must find the £ for 
which 


jeť+D] 


rb% eO 


for some constant c. Suppose that this relationship does indeed hold, and use this 
proportionality expression to replace €“) and e+” in (2.27), leaving only terms in 
€“). Then, after some rearrangement of terms, it suffices to find the 6 for which 


1+1/8 
lim Je®| A+- (2.28) 
t> oo 

The right-hand side of (2.28) is a positive constant. Therefore 1 — 6 + 1/8 = 0. The 
solution is B = (1 + V5) /2 ~ 1.62. Thus, the secant method has a slower order of 
convergence than Newton’s method. 


2.1.4 Fixed-Point Iteration 


A fixed-point of a function is a point whose evaluation by that function equals itself. 
The fixed-point strategy for finding roots is to determine a function G for which 
g'(x) = 0 if and only if if G(x) = x. This transforms the problem of finding a root of 
g’ into a problem of finding a fixed point of G. Then the simplest way to hunt for a 
fixed point is to use the updating equation xt) = G(x), 

Any suitable function G may be tried, but the most obvious choice is G(x) = 
g'(x) + x. This yields the updating equation 


PFD = xO 4 ol), (2.29) 
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The convergence of this algorithm depends on whether G is contractive. To be 
contractive on [a, b], G must satisfy 
1. G(x) € [a, b] whenever x € [a, b], and 
2. |G(x1) — G(x2)| < Alx, — x2| for all x1, x2 € [a, b] for some å € [0, 1). 
The interval [a, b] may be unbounded. The second requirement is a Lipschitz condi- 
tion, and à is called the Lipschitz constant. If G is contractive on [a, b], then there exists 


a unique fixed point x* in this interval, and the fixed-point algorithm will converge to 
it from any starting point in the interval. Furthermore, under the conditions above, 


(t) 


At 
|x JEE zk xO), (2.30) 


Proof of a contractive mapping theorem like this can be found in [6, 521]. 
Fixed-point iteration is sometimes called functional iteration. Note that both 
Newton’s method and the secant method are special cases of fixed-point iteration. 


2.1.4.1 Scaling If fixed-point iteration converges, the order of convergence de- 
pends on A. Convergence is not universally assured. In particular, the Lipschitz con- 
dition holds if |G’(x)| < à < 1 for all x in [a, b]. If G(x) = g'(x) + x, this amounts 
to requiring |g”(x) + 1| < 1 on [a, b]. When g” is bounded and does not change sign 
on [a, b], we can rescale nonconvergent problems by choosing G(x) = ag’(x) + x 
for a # 0, since ag’(x) = 0 if and only if g'(x) = 0. To permit convergence, œ must 
be chosen to satisfy |ag”(x) + 1| < 1 on an interval including the starting value. 
Although one could carefully calculate a suitable a, it may be easier just to try a few 
values. If the method converges quickly, then the chosen «œ was suitable. 

Rescaling is only one of several strategies for adjusting G. In general, the 
effectiveness of fixed-point iteration is highly dependent on the chosen form of G. 
For example, consider finding the root of g'(x) = x + log x. Then G(x) = (x + e~*)/2 
converges quickly, whereas G(x) = e~* converges more slowly and G(x) = — log x 
fails to converge at all. 


Example 2.3 (A Simple Univariate Optimization, Continued) Figure 2.6 illus- 
trates the first several steps of the scaled fixed-point algorithm for maximizing the 
function g(x) = (log x)/(1 + x) in (2.2) using G(x) = g'(x) + x and a = 4. Note that 
line segments whose roots determine the next x are parallel, with slopes equal 
to —1/a. For this reason, the method is sometimes called the method of parallel 
chords. 


Suppose an MLE is sought for the parameter of a quadratic log likelihood J, or 
one that is nearly quadratic near 6. Then the score function is locally linear, and /’’ is 
roughly a constant, say y. For quadratic log likelihoods, Newton’s method would use 
the updating equation 0+) = 6 — 1'(6)/y. If we use scaled fixed-point iteration 
with œ = —1/y, we get the same updating equation. Since many log likelihoods are 
approximately locally quadratic, scaled fixed-point iteration can be a very effective 
tool. The method is also generally quite stable and easy to code. 
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FIGURE 2.6 First three steps of scaled fixed-point iteration to maximize g(x) = 
(log x)/(1 + x) using G(x) = g'(x) + x and scaling with a = 4, as in Example 2.3. 


2.2 MULTIVARIATE PROBLEMS 


In a multivariate optimization problem we seek the optimum of a real-valued function 
g of a p-dimensional vector x = (x1,..., x Bes At iteration t, denote the estimated 
optimum as x) = a: arot ID 

Many of the general principles discussed above for the univariate case also 
apply for multivariate optimization. Algorithms are still iterative. Many algorithms 
take steps based on a local linearization of g’ derived from a Taylor series or se- 
cant approximation. Convergence criteria are similar in spirit despite slight changes 
in form. To construct convergence criteria, let D(u, v) be a distance measure 
for p-dimensional vectors. Two obvious choices are D(u, v) = D qı lu; — vi| and 


D(u, v) = 4 YE (ui — vi)?. Then absolute and relative convergence criteria can be 


formed from the inequalities 


DaD, x) DaD, x) 


Dx), x) < €, : ——____ <e€ 
( ) D(x, 0) Dx, 0) +€ 


2.2.1 Newton’s Method and Fisher Scoring 


To fashion the Newton’s method update, we again approximate g(x*) by the quadratic 
Taylor series expansion 


1 
(x*) = g(x) + (x* — xO) g a) + 5a xT ext — x) (231) 
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FIGURE 2.7 Application of Newton’s method for maximizing a complicated bivariate func- 
tion, as discussed in Example 2.4. The surface of the function is indicated by shading and 
contours, with light shading corresponding to high values. Two runs starting from x and xO 
are shown. These converge to the true maximum and to a local minimum, respectively. 


and maximize this quadratic function with respect to x* to find the next iterate. Setting 
the gradient of the right-hand side of (2.31) equal to zero yields 


g'a) +g'aO)a* — x) = 0. (2.32) 
This provides the update 
xD = XO — g'a Og aO). (2.33) 


Alternatively, note that the left-hand side of (2.32) is in fact a linear Taylor 
series approximation to g’(x*), and solving (2.32) amounts to finding the root of this 
linear approximation. From either viewpoint, the multivariate Newton increment is 
h® = =g" xO lg (xO). 

As in the univariate case, in MLE problems we may replace the observed 
information at 6 with I(0®), the expected Fisher information at 0. This yields 
the multivariate Fisher scoring approach with update given by 


6) — 9 + 10O TO). (2.34) 


This method is asymptotically equivalent to Newton’s method. 


Example 2.4 (Bivariate Optimization Problem) Figure 2.7 illustrates the appli- 
cation of Newton’s method to a complicated bivariate function. The surface of the 
function is indicated by shading and contour lines, with high values corresponding to 
light shading. The algorithm is started from two different starting points, x) and x: 
From xO), the algorithm converges quickly to the true maximum. Note that although 


steps were taken in an uphill direction, some step lengths were not ideal. From xO, 
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which lies very close to xO), the algorithm fails to maximize the function—in fact it 
converges to a local minimum. One step length in this attempt was so large as to com- 
pletely overshoot the uphill portion of the ridge, resulting in a step that was downhill. 
Near the end, the algorithm steps downhill because it has honed in on the wrong root 
of g’. In Section 2.2.2, approaches for preventing such problems are discussed. 


2.2.1.1 Iteratively Reweighted Least Squares Consider finding the MLEs 
for the parameters of a logistic regression model, which is a well-known type of 
generalized linear model [446]. In a generalized linear model, response variables 
Y; for i= 1,...,n are independently distributed according to a distribution pa- 
rameterized by 6;. Different types of response variables are modeled with different 
distributions, but the distribution is always a member of the scaled exponential fam- 
ily. This family has the form f(y|0) = exp {Ly@ — b(@)]/a(¢) + c(y, @)}, where 6 is 
called the natural or canonical parameter and ¢ is the dispersion parameter. Two of the 
most useful properties of this family are that E{Y} = b’(6) and var{Y} = b’’(@)a(d) 
(see Section 1.3). 

The distribution of each Y; is modeled to depend on a corresponding set of 
observed covariates, z;. Specifically, we assume that some function of E{Yj|z;} can 
be related to z; according to the equation g(E{Y;|z;}) = zB, where £$ is a vector of 
parameters and g is called the link function. 

The generalized linear model used for logistic regression is based on the 
Bernoulli distribution, which is a member of the exponential family. Model the re- 
sponse variables as Y;|z; ~ Bernoulli(z;) independently for i = 1, ...,n. Suppose 
the observed data consist of a single covariate value z; and a response value y,, for 
i = 1,...,n. Define the column vectors z; = (1, z;)? and B = (fo, B,)". Then for the 
ith observation, the natural parameter is 0; = log{7;/(1 — 7;)}, a(@) = 1, and b(@;) = 
log{1 + exp{6;}} = log{1 + exp{z} B}} = — log{1 — z;}. The log likelihood is 


KB) =y' ZB —b'1, (2.35) 


where 1 is a column vector of ones, y = (y1 .. yn) ty b = (b(9;)...b(@,))", and Z 
is the n x 2 matrix whose ith row is Z; . 

Consider using Newton’s method to find £ that maximizes this likelihood. The 
score function is 


I(B) =Z"(y — x), (2.36) 
where x is a column vector of the Bernoulli probabilities 71, ..., Zn. The Hessian is 
given by 

V) = Say - x) ar\' z 2 wr (2.37) 
= T = — = s J 
dp. Y dp 


where W is a diagonal matrix with ith diagonal entry equal to z;(1 — 7t;). 
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Newton’s update is therefore 


por) = po? = BOTY eO) (2.38) 
= 6 + (2z7wz) a (Zy - 2%), (2.39) 


where x is the value of x corresponding to B®, and W is the diagonal weight 
matrix evaluated at x“), 

Note that the Hessian does not depend on y. Therefore, Fisher’s information ma- 
trix is equal to the observed information: I($) = E{—I(B)} = E{Z'WZ} = —I'(B). 
Therefore, for this example the Fisher scoring approach is the same as Newton’s 
method. For generalized linear models, this will always be true when the link func- 
tion is chosen to make the natural parameter a linear function of the covariates. 


Example 2.5 (Human Face Recognition) We will fit a logistic regression model 
to some data related to testing a human face recognition algorithm. Pairs of images 
of faces of 1072 humans were used to train and test an automatic face recognition al- 
gorithm [681]. The experiment used the recognition software to match the first image 
of each person (called a probe) to one of the remaining 2143 images. Ideally, a match 
is made to the other image of the same person (called the target). A successful match 
yielded a response of y; = 1 and a match to any other person yielded a response of 
yi = 0. The predictor variable used here is the absolute difference in mean standard- 
ized eye region pixel intensity between the probe image and its corresponding target; 
this is a measure of whether the two images exhibit similar quality in the important 
distinguishing region around the eyes. Large differences in eye region pixel intensity 
would be expected to impede recognition (1.e., successful matches). For the data de- 
scribed here, there were 775 correct matches and 297 mismatches. The median and 
90th percentile values of the predictor were 0.033 and 0.097, respectively, for image 
pairs successfully matched, and 0.060 and 0.161 for unmatched pairs. Therefore, the 
data appear to support the hypothesis that eye region pixel intensity discrepancies 
impede recognition. These data are available from the website for this book; analyses 
of related datasets are given in [250, 251]. 

To quantify the relationship between these variables, we will fit a logistic regres- 
sion model. Thus, z; is the absolute difference in eye region intensity for an image pair 
and y; indicates whether the ith probe was successfully matched, fori = 1,..., 1072. 
The likelihood function is composed as in (2.35), and we will apply Newton’s 
method. 

To start, we may take B = (80, pO)" = (0.95913, 0), which means 7; = 
775/1072 for all i at iteration 0. Table 2.1 shows that the approximation converges 
quickly, with B = (1.73874, —13.58840)". Quick convergence is also achieved 
from the starting value corresponding to ; = 0.5 for all i (namely B® = 0), which 
is a suggested rule of thumb for fitting logistic regression models with Bernoulli data 
[320]. Since B 1 = —13.59 is nearly 9 marginal standard deviations below zero, these 
data strongly support the hypothesis that eye region intensity discrepancies impede 
recognition. 


38 CHAPTER 2 OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS 


TABLE 2.1 Parameter estimates and corresponding variance- 
covariance matrix estimates are shown for each Newton’s method 
iteration for fitting a logistic regression model to the face recognition 
data described in Example 2.5. 


Iteration, t pe (pg)! 

ô o ( 0.01067 eo 
0.00000 —0.11412 2.16701 

j ( a ( 0.13312 E 
—14.20059 —0.14010 2.36367 

‘ ( 1a ( 0.01347 a) 
—13.56988 —0.13941 2.32090 

; ( Gey) ( 0.01349 oo, 
—13.58839 —0.13952 2.32241 

i ( a ( 0.01349 ee) 
—13.58840 —0.13952 2.32241 


The Fisher scoring approach to maximum likelihood estimation for generalized 
linear models is important for several reasons. First, it is an application of the method 
of iteratively reweighted least squares (IRLS). Let 


eO = y- n” (2.40) 
and 
xO = Zp” 4s (WO LeO., (2.41) 


Now the Fisher scoring update can be written as 
BED = pO 4 (z'woz) Ziel) 
= (z™wz) i zt wozp + ZTWOW) eO 
= (zw) TE Tyg, (2.42) 


We call x the working response because it is apparent from (2.42) that B°* are the 
regression coefficients resulting from the weighted least squares regression of x on 
Z with weights corresponding to the diagonal elements of W. At each iteration, a 
new working response and weight vector are calculated, and the update can be fitted 
via weighted least squares. 

Second, IRLS for generalized linear models is a special case of the Gauss- 
Newton method for nonlinear least squares problems, which is introduced briefly 
below. IRLS therefore shows the same behavior as Gauss—Newton; in particular, it 
can be a slow and unreliable approach to fitting generalized linear models unless the 
model fits the data rather well [630]. 
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2.2.2 Newton-Like Methods 


Some very effective methods rely on updating equations of the form 
xt) = xO _ (MOY !g' (xO) (2.43) 


where M isa p x p matrix approximating the Hessian, g(x). In general optimiza- 
tion problems, there are several good reasons to consider replacing the Hessian by 
some simpler approximation. First, it may be computationally expensive to evaluate 
the Hessian. Second, the steps taken by Newton’s method are not necessarily always 
uphill: At each iteration, there is no guarantee that g(x“+!) > g(x). A suitable 
M can guarantee ascent. We already know that one possible Hessian replacement, 
M® = —1(6), yields the Fisher scoring approach. Certain other (possibly scaled) 
choices for M™ can also yield good performance while limiting computing effort. 


2.2.2.1 Ascent Algorithms To force uphill steps, one could resort to an ascent 
algorithm. (Another type of ascent algorithm is discussed in Chapter 3.) In the present 
context, the method of steepest ascent is obtained with the Hessian replacement 
M = —I, where I is the identity matrix. Since the gradient of g indicates the steep- 
est direction uphill on the surface of g at the point x, setting x+ D = x + g’(x) 
amounts to taking a step in the direction of steepest ascent. Scaled steps of the form 
x) = x 4 gOg'(x®) for some a > 0 can be helpful for controlling conver- 
gence, as will be discussed below. 
Many forms of M will yield ascent algorithms with increments 


hO = -a [MP] ga). (2.44) 
For any fixed x and negative definite M®, note that as a — 0 we have 


gh) — gx) = gx +h) — ga”) 
= aP g ATMO g aO) + of), (2.45) 


where the second equality follows from the linear Taylor expansion g(x + h®) = 
g(x) + g’(x) Th + o(a). Therefore, if —M is positive definite, ascent can 
be assured by choosing @ sufficiently small, yielding g(x“+) — g(x) > 0 from 
(2.45) since o(a)/a — Oas a” —> 0. 

Typically, therefore, an ascent algorithm involves a positive definite matrix 
—M to approximate the negative Hessian, and a contraction or step length parameter 
a) > 0 whose value can shrink to ensure ascent at each step. For example, start each 
step with a” = 1. If the original step turns out to be downhill, a” can be halved. 
This is called backtracking. If the step is still downhill, œ® is halved again until 
a sufficiently small step is found to be uphill. For Fisher scoring, —M“ = 16), 
which is positive semidefinite. Therefore backtracking with Fisher scoring would 
avoid stepping downhill. 


Example2.6 (Bivariate Optimization Problem, Continued) Figure 2.8 illustrates 
an application of the steepest ascent algorithm to maximize the bivariate function 
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FIGURE 2.8 Applications of two optimization methods for maximizing a complex bivariate 
function. The surface of the function is indicated by shading and contours, with light shading 
corresponding to high values. The two methods start at a point x® and find the true maximum, 
x*. The solid line corresponds to the method of steepest ascent (Example 2.6). The dashed 
line corresponds to a quasi-Newton method with the BFGS update (Example 2.7). Both algo- 
rithms employed backtracking, with the initial value of œ” at each step set to 0.25 and 0.05, 
respectively. 


discussed in Example 2.4, starting from x® and initialized with a = i at each 
step. The steps taken by steepest ascent are shown by the solid line. Although the 
optimization was successful, it was not fast or efficient. The dashed line illustrates 
another method, discussed in Section 2.2.2.3. 


Step halving is only one approach to backtracking. In general methods that rely 
on finding an advantageous step length in the chosen direction are called line search 
methods. Backtracking with a positive definite replacement for the negative Hessian 
is not sufficient to ensure convergence of the algorithm, however, even when g is 
bounded above with a unique maximum. It is also necessary to ensure that steps make 
a sufficient ascent (i.e., require that g(x) — g(x“—) does not decrease too quickly 
as t increases) and that step directions are not nearly orthogonal to the gradient (i.e., 
avoid following a level contour of g). Formal versions of such requirements include 
the Goldstein—Armijo and Wolfe—Powell conditions, under which convergence of 
ascent algorithms is guaranteed [14, 270, 514, 669]. 

When the step direction is not uphill, approaches known as modified Newton 
methods alter the direction sufficiently to find an uphill direction [247]. A quite effec- 
tive variant is the modified Cholesky decomposition approach [246]. In essence, when 
the negative Hessian is not positive definite, this strategy replaces it with —2”(x) = 
—g"(x) + E, where E is a diagonal matrix with nonnegative elements. By crafting E 
carefully to ensure that —8” (x) is positive definite without deviating unnecessarily 
from the original direction —g’(x), a suitable uphill direction can be derived. 
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2.2.2.2 Discrete Newton and Fixed-Point Methods To avoid calculating 
the Hessian, one could resort to a secant-like method, yielding a discrete Newton 
method, or rely solely on an initial approximation, yielding a multivariate fixed-point 
method. 

Multivariate fixed-point methods use an initial approximation of g” throughout 
the iterative updating. If this approximation is a matrix of constants, so M = M for 
all t, then the updating equation is 


xD = xO _ Mol g(x), (2.46) 


A reasonable choice for M is g(x). Notice that if M is diagonal, then this amounts 
to applying the univariate scaled fixed-point algorithm separately to each component 
of g. See Section 2.1.4 for more on the relationship between fixed-point iteration and 
Newton’s method when maximizing log likelihoods that are locally quadratic. 

Multivariate discrete Newton methods approximate the matrix g’(x) with a 
matrix M of finite-difference quotients. Let gi(x) = dg(x)/dx; be the ith element 
of g'(x). Let e; denote the p-vector with a 1 in the jth position and zeros elsewhere. 
Among the ways one might approximate the (i, j)th element of the Hessian using 
discrete differences, perhaps the most straightforward is to set the (i, j)th element of 
M” to equal 


(2.47) 


for some constants hi? It is easiest to use hy = h for all (i, j) and t, but this leads 
to a convergence order of 6 = 1. Alternatively, we can generally obtain an order of 


convergence similar to that of the univariate secant method if we set hP = = x) 


ae D for all i, where y ) denotes the jth element of x”, It is important to ete 


M with its transpose to ensure symmetry before proceeding with the update of x 
given in (2.43). 


2.2.2.3 Quasi-Newton Methods The discrete Newton method strategy for 
numerically approximating the Hessian by M“” is a computationally burdensome 
one. At each step, M® is wholly updated by calculating a new discrete difference for 
each element. A more efficient approach can be designed, based on the direction of 
the most recent step. When x") is updated to x“+) = x + h®, the opportunity is 
presented to learn about the curvature of g in the direction of h® near x). Then M 
can be efficiently updated to incorporate this information. 

To do this, we must abandon the componentwise discrete-difference approxi- 
mation to g” used in the discrete Newton method. However, it is possible to retain a 
type of secant condition based on differences. Specifically, a secant condition holds 
for MHD if 


g(x) = g'a”) = MEtD XD = x), (2.48) 
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This condition suggests that we need a method to generate M“+!) from M® in a 
manner that requires few calculations and satisfies (2.48). This will enable us to gain 
information about the curvature of g in the direction of the most recent step. The 
result is a quasi-Newton method, sometimes called a variable metric approach [153, 
247, 486]. 

There is a unique symmetric rank-one method that meets these requirements 
[134]. Let 2 = xtd — x and y” = g’(x+)) — g'(x”). Then we can write the 
update to M as 


MED = MË + cOvO VOT (2.49) 


where v = y® — MOZ” and ce = 1/[(v)?z]. 

It is important to monitor the behavior of this update to M™. If c cannot 
reliably be calculated because the denominator is zero or close to zero, a temporary 
solution is to take M“+)) = M“ for that iteration. We may also wish to backtrack to 
ensure ascent. If —M® is positive definite and c® < 0, then -M“*? will be positive 
definite. We use the term hereditary positive definiteness to refer to the desirable 
situation when positive definiteness is guaranteed to be transferred from one iteration 
to the next. If c® > 0, then it may be necessary to backtrack by shrinking c® toward 
zero until positive definiteness is achieved. Thus, positive definiteness is not hereditary 
with this update. Monitoring and backtracking techniques and method performance 
are further explored in [375, 409]. 

There are several symmetric rank-two methods for updating a Hessian approx- 
imation while retaining the secant condition. The Broyden class [78, 80] of rank-two 
updates to the Hessian approximation has the form 


M29 MOZ))T — yOy)T 
(2) TMOz0 GOTY 
4.50 (COMO) da)?, (2.50) 


Moet) = MO 


where 
oy? MOz0 
= (g@)Ty® OMO 


The most popular member of this class is the BFGS update [79, 197, 269, 588], which 
simply sets 6 = 0. An alternative, for which 6” = 1, has also been extensively 
studied [134, 199]. However, the BFGS update is generally accepted as superior to 
this, based on extensive empirical and theoretical studies. The rank-one update in 
(2.49) has also been shown to perform well and to be an attractive alternative to 
BFGS [120, 375]. 

The BFGS update—indeed, all members of the Broyden class—confer heredi- 
tary positive definiteness on -M™. Therefore, backtracking can ensure ascent. How- 
ever, recall that guaranteed ascent is not equivalent to guaranteed convergence. The 
order of convergence of quasi-Newton methods is usually faster than linear but slower 
than quadratic. The loss of quadratic convergence (compared to a Newton’s method) 
is attributable to the replacement of the Hessian by an approximation. Nevertheless, 
quasi-Newton methods are fast and powerful, and are among the most frequently used 
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methods in popular software packages. Several authors suggest that the performance 
of (2.49) is superior to that of BFGS [120, 409]. 


Example 2.7 (Bivariate Optimization Problem, Continued) Figure 2.8 illustrates 
an application of quasi-Newton optimization with the BFGS update and backtracking 
for maximizing the bivariate function introduced in Example 2.4, starting from x©) 
and initialized with a) = 0.05 at each step. The steps taken in this example are 
shown by the dashed line. The optimization successfully (and quickly) found x*. 
Recall that the solid line in this figure illustrates the steepest ascent method discussed 
in Section 2.2.2.1. Both quasi-Newton methods and steepest ascent require only first 
derivatives, and backtracking was used for both. The additional computation required 
by quasi-Newton approaches is almost always outweighed by its superior convergence 
performance, as was seen in this example. 


There has been a wide variety of research on methods to enhance the perfor- 
mance and stability of quasi-Newton methods. Perhaps the most important of these 
improvements involves the calculation of the update for M. Although (2.50) pro- 
vides a relatively straightforward update equation, its direct application is frequently 
less numerically stable than alternatives. It is far better to update a Cholesky decom- 
position of M as described in [245]. 

The performance of quasi-Newton methods can be extremely sensitive to the 
choice of the starting matrix M). The easiest choice is the negative identity matrix, 
but this is often inadequate when the scales of the components of x differ greatly. 
In MLE problems, setting M© = —I(0®) is a much better choice, if calculation of 
the expected Fisher information is possible. In any case, it is important to rescale 
any quasi-Newton optimization problem so that the elements of x are on comparable 
scales. This should improve performance and prevent the stopping criterion from 
effectively depending only on those variables whose units are largest. Frequently, in 
poorly scaled problems, one may find that a quasi-Newton algorithm will appear to 
converge to a point for which some x® differ from the corresponding elements of the 
starting point but others remain unchanged. 

In the context of MLE and statistical inference, the Hessian is critical because 
it provides estimates of standard error and covariance. Yet, quasi-Newton methods 
rely on the notion that the root-finding problem can be solved efficiently even us- 
ing poor approximations to the Hessian. Further, if stopped at iteration t, the most 
recent Hessian approximation M“~" is out of date and mislocated at 0°—) instead 
of at 6. For all these reasons, the approximation may be quite bad. It is worth 
the extra effort, therefore, to compute a more precise approximation after iterations 
have stopped. Details are given in [153]. One approach is to rely on the central dif- 
ference approximation, whose (i, j)th element is 


(6) _ FO + hijej) — 40” — hijej) 
2hij 


; (2.51) 


where AC ) is the ith component of the score function evaluated at 6 In this case, 
decreasing hj; is associated with reduced discretization error but potentially increased 
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computer roundoff error. One rule of thumb in this case is to take hjj = h = e!/3 for 
all i and j, where ¢ represents the computer’s floating-point precision [535]. 


2.2.3 Gauss—Newton Method 


For MLE problems, we have seen how Newton’s method approximates the log like- 
lihood function at 0 by a quadratic, and then maximizes this quadratic to obtain 
the update 0°+), An alternative approach can be taken in nonlinear least squares 
problems with observed data (y;, z;) for i = 1, ...,n, where one seeks to estimate 0 
by maximizing an objective function g(0) = — X`;—; (vi — f(zi. 6)”. Such objective 
functions might be sensibly used, for example, when estimating @ to fit the model 


Y; = f(zi, 9) + €i (2.52) 


for some nonlinear function f and random error €j. 

Rather than approximate g, the Gauss—Newton approach approximates f itself 
by its linear Taylor series expansion about @. Replacing f by its linear approximation 
yields a linear least squares problem, which can be solved to derive an update et), 

Specifically, the nonlinear model in (2.52) can be approximated by 


Y; © f(z, 0) + 0 — 0) '#(z;, 0) + ei = Fi, 0, 0) + ei, (2.53) 


where for each i, f’(z;, 6) is the column vector of partial derivatives of f(z;, 0) 
with respect to 6, for j = 1,..., p, evaluated at (z;, 0), A Gauss—Newton step is 


derived from the maximization of 30) = — Xj- [yi — (i. 6, 0)] ? with respect 


i= 
to 0, whereas a Newton step is derived from the maximization of a quadratic approx- 
imation to g itself, namely (6) + (0 — 0)? g’/() + (0 — 0)? "(80 — 0). 
Let x denote a working response whose observed value is x? =yj- 


f(z;, 9), and define a = f'(z;, 0). Then the approximated problem can be re- 
expressed as minimizing the squared residuals of the linear regression model 


XO = AO(0 — 0) + €, (2.54) 


where X® and € are column vectors whose ith elements consist of x? and €j, 


respectively. Similarly, A® is a matrix whose ith row is aP). 


The minimal squared error for fitting (2.54) is achieved when 
(0-6) = (ATA) T! (AO)T XO, (2.55) 
Thus, the Gauss—Newton update for 6 is 
a+) = 6 + (ATA) (AOR, (2.56) 
Compared to Newton’s method, the potential advantage of the Gauss—Newton 


method is that it does not require computation of the Hessian. It is fast when f is 
nearly linear or when the model fits well. In other situations, particularly when the 
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FIGURE 2.9 Simplex superimposed over contours of the objective function g for p = 2. The 
best vertex is near the optimum of g. The best face is the triangle side containing ¢, which is 
its centroid. 


residuals at the true solution are large because the model fits poorly, the method may 
converge very slowly or not at all—even from good starting values. A variant of the 
Gauss—Newton method has better convergence behavior in such situations [152]. 


2.2.4 Nelder—Mead Algorithm 


The algorithms described thus far all rely on derivatives or approximations thereof. In 
many cases, derivation of g’ and g” is undesirable or infeasible. The Nelder—Mead al- 
gorithm is one of aclass of optimization methods that require no derivative information 
[482, 650]. Itis an iterative direct search approach because it depends only on the ranks 
of a collection of function evaluations at possible solutions while it tries to nominate 
a superior point for the next iteration [369, 385, 515, 675]. Chapter 3 describes some 
other direct search methods including genetic algorithms and simulated annealing. 
The tth iteration of the Nelder—Mead algorithm begins with a collection of points 
representing possible solutions, that is, approximations to the maximum. These points 
define a neighborhood—specifically a simplex—near which search effort is currently 
focused. An iteration of the algorithm seeks to reshape and resize the neighborhood 
through the nomination of a new point to replace the worst point in the collection. Ide- 
ally, the candidate may be much better than some other points, or even the best so far. 
When x is p-dimensional, p+ 1 distinct points x),...,Xp+1 define a 
p-dimensional simplex, namely the convex hull of vertices x},...,Xp 41. A sim- 
plex is a triangle or a tetrahedron (i.e., a pyramid with triangular base) when p = 2 
and p = 3, respectively. The vertices of the simplex can be ranked from best to worst 
according to the ranks of g(x1), ..., g(Xp41); see Figure 2.9. When seeking a maxi- 
mum, let Xpest correspond to the vertex with the highest objective function value and 
Xworst to the lowest. Denote the second-worst vertex as Xpaq. Ties can be addressed 
as by [397, 482]. We can also define the best face to be the face opposite Xworst. The 


best face is therefore the hyperplane containing the other points, and its centroid is 


p+1 


the mean of all the other vertices, namely c = (1/p) [( i=l xi) = Xworst]- 
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w w w 


Outer Contraction Inner Contraction Shrinkage 


FIGURE 2.10 The five possible transformations of a simplex. The unshaded triangle is the 
current simplex, the hashed triangle shows the simplex that would be obtained by reflection, 
and the gray triangle represents the simplex adopted at the completion of the respective trans- 
formations. Points labeled w, b and r are the worst, best, and reflection vertices, respectively, 
and c is the centroid of the best face. 


Having determined the best, second-worst, and worst vertices, we try to replace 
the worst vertex with a better one. The algorithm requires that the new point will lie 
upon the ray extending from Xworst through c, which we call the search direction. 
Therefore, in this sense, the new vertex location will be moved in the direction of 
better alternatives away from the worst vertex. Further, selecting a replacement for 
Xworst Will change the shape and size of the simplex. Although this search direction 
may be promising, the quality of the new vertex will also depend on the distance of 
the new vertex from Xworst. This distance affects simplex size. Indeed, the Nelder— 
Mead algorithm is sometimes referred to as the amoeba method, reflecting the flexible 
transformation and movement of the simplex as its size and shape change to adapt to 
the local hills and valleys of the objective function [517]. 

The location of the chosen new vertex is based upon the reflection vertex 
x, defined as x, = ¢ + a;-(€ — Xworst) as Shown in Figure 2.10. Reflections require 
a, > 0, and usually œ, = 1. Although x, itself may not be the new vertex, it, c, and 
Xworst are used to determine the new point. The following paragraphs describe the as- 
sortment of ways by which the new vertex can be derived, and Figure 2.10 illustrates 
these methods and the resulting simplex transformations. 

Consider first when g(x;) exceeds g(Xbaa). If g(x,) does not also exceed the 
objective function values for Xpest, then x, is accepted as a new vertex and Xworst 1S 
discarded. The updated collection of vertices defines a new simplex (Figure 2.10), 
and a new iteration of the algorithm begins. However, if g(x;) > g(Xbest) so that the 
reflection vertex is better than the current best, then even greater improvement is 
sought by extending search further in the direction pursued by x,. This leads to an 
attempt at expansion. If x, is worse than Xpaq, then we try to mitigate this unfortunate 
outcome using a contraction of x;. 

An expansion occurs when g(x,) exceeds 9(Xpest). An expansion point Xe is then 
defined asx, = € + a(x; — €), where œe > max(1, a) and usually œe = 2. Thus xe is 
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a point along the search direction vector beyond x+. Since exploration in the direction 
leading to x, yielded such a good point, the hope is that even more improvement 
might be achieved by going further in that direction. If g(x.) exceeds g(x;), then 
expansion was successful so Xe is accepted as a new vertex, Xworst 18 discarded, and 
a new iteration is begun. If g(Xe) fails to surpass g(x,), then x, is retained as the 
improved vertex, Xworst is discarded, and a new iteration is begun. 

Thus far we have describe a process which leads to acceptance of a new vertex 
(x; or Xe) whenever the reflected point is better than Xpaq. When g(x;) is no greater 
than g(Xbaa), then additional search is needed because x, would be the worst vertex 
even though it replaced Xworst. The contraction strategy is to identify a final vertex 
somewhere along the search direction between Xworst and x;. When this vertex is 
between c and x,, the transformation is called an outer contraction, otherwise it is an 
inner contraction. 

An outer contraction is conducted when g(Xpad) > (X+) > g(Xworst). The vertex 
obtained by an outer contraction is defined as x, = € + a(x, — €) where 0 < œe < 1 
and normally a. = 7 If g(x,) => g(x) so that the outer contraction vertex is at least 
as good as the reflection vertex, then x, is chosen to replace Xworst. Otherwise, we 
are in a situation where x, would be the worst vertex after it replaced Xworst. In this 
case, instead of performing that pointless replacement, a shrink transformation is 
performed as described later. 

An inner contraction is attempted when g(x;) < g(Xworst), that is, when x; is 
worse than all vertices of the current simplex. In this case an inner contraction point 
is defined as x; = € + &e(Xworst — €). Then if g(x;) > g(Xworst), then x; is chosen to 
replace Xworst- Otherwise, no reasonable replacement for Xworst has been identified. 
Again a shrink transformation is warranted. 

When all else fails, the simplex is subjected to a shrink transformation. In this 
case, all vertices except the best are shrunk toward Xpest by transforming the jth 
vertex x; to Xj according to Xs; = Xbest + @s(Xj — Xbest) Where j indexes vertices 
for j= 1,..., p+ 1. Shrinking has no effect on Xpest so this calculation is omitted. 
Shrinking after a failed contraction will focus the simplex near the vertex with the 
greatest objective function value. In practice, shrinking happens very rarely. Shrinkage 
requires 0 < a, < 1 and normally a; = e 

In summary, the Nelder—Mead algorithm follows the following steps. 

1. Initialize. For t = 1 choose starting vertices xi ree ae Choose œ, > 0, 
Qe > max{l,a;}, O < a, < 1, and O <a, < 1. Standard values for (œ+, Œe, 
Ge, @s) are (1, 2, 5, $) [397, 482]. 


2. Sort. Among the current set of vertices, identify the ones that yield the highest, 


second lowest, and lowest objective function evaluations, namely am x®,, 


(C) 


and Xworst 2 


respectively. 


3. Orient. Compute 


p | 2H 
O = y (Èa) — xon 
i=1 
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4. Reflect. Compute x = ce + a,c” — xwr). Compare g(x) to g(x 


. Expansion. Compute x = e + a(x — e). Compare g(x) to g(x 


. Shrinking. For all j = 1,..., p+ 1 for which x) Æ+ x® compute Xj 
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(Q) (t) 

O best) 
t 

and 9(X;,4) 


a. If g(x, 


t+ 1 and discard x 


)> g(x) > g(x), then accept x as a new vertex for iteration 
(t) 


worst’ 


Go to the stopping step. 


b. If gat) > g(x? ), then continue to the expansion step. 
c. Otherwise skip ahead to the contraction step. 


(t) 

best): 
a. If g(x) > o(x), then accept x, as a new vertex for iteration t + 1 and 
(t) 


discard X\orst- 


Go to the stopping step. 


b. Otherwise, accept x as a new vertex for iteration t + 1 and discard x st 
Go to the stopping step. 


. Contraction. Compare o(x) to g(x?) and (x? ). If g(x? ea as") > 


worst 
(x cts then perform an outer contraction. Otherwise [i.e., when g(x > > 
o(x)] perform an inner contraction. 


a. Outer contraction. Compute x = c® + a(x — ¢). 


i. If g(x) > g(x), then accept x™ as a new vertex for iteration ¢ + 1 and 


discard xO st Go to the stopping step. 


ii. Otherwise, go to the shrinking step. 


b. Inner contraction. Compute x? = tax? eO). 


‘worst 


(1) 


. If gat) > gCon): then accept x; as a new vertex for iteration t + 1 


and discard x‘ Go to the stopping step. 


oe 


ii. Otherwise, go to the shrinking step. 


() _ 


ms + a(x? — x ). Collect x and these p new vertices to form the sim- 


plex for iteration t + 1. Go to the stopping step. 


. Stopping. Check convergence criteria. If stopping is not warranted, increment 


(t) 


t tot + 1 and begin a new iteration by returning to the sort step. Otherwise x;.., 


is taken to be the approximate maximizer of g. 


An easy choice for initialization is to form a starting simplex around an initial 


guess for the optimum, say xo, by using xo as one vertex and choosing the remaining 
vertices along coordinate axes to make the simplex right-angled at xq. 


Some variations to this standard algorithm have been explored. Small changes 


to the decision rules have been considered by [85, 448, 502, 518]. More radical 
alternatives of such further constraints on the nomination or acceptance of new vertices 
are explored by [85, 479, 518, 565, 636, 650]. Alternative choices for (œ+, Œe, @c, Hs) 
are mentioned by [24, 85, 502]. 
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FIGURE 2.11 Initial steps of the Nelder—-Mead algorithm for maximizing a complicated 
bivariate function. The surface of the function is indicated by shading and contours, with light 
shading corresponding to high values. In the left panel, the algorithm is initiated with simplex 
1. An expansion step yields simplex 2. Continuing in the right panel, the next two steps are 
inner contractions, yielding simplices 3 and 4. 


Example 2.8 (Bivariate Optimization Problem, Continued) Let us examine an 
application of the standard Nelder—Mead algorithm to maximize the bivariate function 
discussed in Example 2.4. Figures 2.11 and 2.12 show the results. The starting simplex 
was defined by Xworst = (3.5, —0.5), Xbaqd = (3.25, —1.4) and Xpest = (3, —0.5). After 
initialization, the algorithm takes a reflection step, finding a new Xpest with much 
greater objection function value. However, this yields a poor search direction because 
reflection and outer contraction would both produce points far down the opposite 
side of the ridge just ascended. Instead, two inner contraction steps occur, at which 
point the best face has changed. The new search direction is now good, and the next 
two steps are an expansion and a reflection. The right panel of Figure 2.12 shows 
the remaining progress of the algorithm. Although convergence is slow, the correct 
maximum is eventually found. 


Two types of convergence criteria are needed for the Nelder—Mead algorithm. 
First, some measure of the (relative) change in vertex locations should be exam- 
ined. It is important to note, however, that a lack of change in Xpest alone will not 
provide complete information about search progress because Xpest May remain un- 
changed for several successive iterations. Further, it can be more effective to monitor 
the convergence of, for example, simplex volume rather than any particular point. 
Such a criterion corresponds to the relative convergence criterion we have considered 
previously in this chapter and, notably, is important when optimizing discontinuous 
functions. Second, one should determine whether the values of the objective function 
appear to have converged. Performance of variants of the Nelder—Mead algorithm can 
be improved using modified stopping rules [600, 636]. 

The Nelder—Mead method is generally quite good at finding optima, especially 
for low to moderate dimensions [397, 675]. For high-dimensional problems, its ef- 
fectiveness is more varied, depending on the nature of the problem [85, 493, 600]. 
Theoretical analysis of algorithm convergence has been limited to restricted classes 
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FIGURE 2.12 Further steps of the Nelder-Mead algorithm for maximizing a complicated 
bivariate function. The surface of the function is indicated by shading and contours, with light 
shading corresponding to high values. Continuing on from Figure 2.11, the left panel shows 
that the next two Nelder—Mead steps are an expansion (simplex 5) and a reflection (simplex 6). 
In the right panel, further steps are shown as iterations hone in on the maximum. 


of functions or substantively modified versions of the standard algorithm [85, 368, 
397, 518, 636]. 

The Nelder—Mead approach is quite robust in the sense that it can successfully 
find optima for a wide range of functions—even discontinuous ones—and from a 
wide range of starting values [397, 493, 675]. Moreover, it is robust in the sense 
that convergence is often not much impeded when the objective function values are 
contaminated with random noise [24, 154]. 

Despite its good performance, the Nelder—Mead algorithm can perform poorly 
in certain circumstances. Surprisingly, itis even possible for the algorithm to converge 
to points that are neither local maxima nor minima. The following example illustrates 
one such case. 


Example 2.9 (Nelder—Mead Failure) A failure of the Nelder-Mead method is 
illustrated by Figure 2.13 where the simplex collapses [448]. 

Consider the following bivariate objective function using p=2 and 
x = (x1, X2): 


A (2.57) 


—360|x1|7 — x2 — x3 ifx <0 
=6x] — x2 — x5 otherwise. 


Let the starting values be (0,0), (1, 1), and roughly (0.84, —0.59). Then for this 
surprisingly simple function, iterations produce simplices whose best vertex never 
changes despite it being far from any extremum and yet the simplex area converges 
to zero in the manner shown in Figure 2.13. As this happens, the search direction 
becomes orthogonal to g’ so that the improvement at successive iterations converges 
to zero. 


When the algorithm stagnates as in Example 2.9, restarting with a different 
simplex can often remedy the problem by setting the algorithm on a different and 


2.2 MULTIVARIATE PROBLEMS 51 


UTEE 
UER 
Ia 
Inna 
UTEE) 
AE) 
wet 
Wet 
WIE x 
wet 

we 

wes 

atin S 
mni Yl = 


I 
I 
I 
l 


FIGURE 2.13 Contours of g and successive simplices for Example 2.9. The solid dots indi- 
cate the c® locations, the hollow circle is Xpes for every f, and x* is the global maximum of g. 


possibly more productive ascent path. Alternatively, the oriented restart approach 
is specifically designed to reshape the simplex in a manner targeting steepest 
ascent [368]. Define a p x p matrix of simplex directions as V® = (x? — x9, 


x) — x), ee re x.) and a corresponding vector of objective function differ- 


, p+! 
ences as 86” = (g(x) — ga®), g(x$?) — ga®), ore g(x 1) — g(x\)), where the 


p + 1 vertices are all ordered with respect to quality so that x = see Then we may 
define the simplex gradient of simplex S® to be D'S) = (V)—-18, This simplex 
gradient is designed to approximate the true gradient of g at x®, 

An oriented restart is triggered when the average vertex improvement is too 


small. Specifically, let 
Oe N g(x)” 2.58 
Bo gax) (2.58) 


be the average objective function value of the vertices of S at iteration t. Define a 
sufficient increase in simplex quality from iteration ¢ to t + 1 be one for which 


2 
get) zO >e oso] (2.59) 


where € is chosen to be a small number (e.g., 0.0001). 
In this case, the oriented restart consists of replacing all vertices except Xbest 


with vertices situated on coordinate axes centered at x 


best 
Specifically, let x = x and 


and having reduced lengths. 


x) = xP + Bie; (2.60) 
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for j = 2,..., p +1 where ej are the p unit vectors along the coordinate axes and 
8; orients and scales the coordinate-wise steps according to 


p, [70 Sign DIS}? itsin DS}? )} #0 (2.61) 
j 0 otherwise. . 


In (2.61), D(S a for j = 2,..., p + 1 represents the corresponding component of 


the gradient of S and the scalar factor d is the minimum oriented length 


d® = min 
2<j<ptl 


|x? - x? (2.62) 


The rationale for an oriented restart is that the new simplex gradient at Xpeg should 
point in a direction that approximates the true objective function gradient once the 
simplex is small enough, provided that the simplex gradient is in the correct orthant. 
In a case like Example 2.9, note further that the oriented restart strategy would halt 
the simplex collapse. 

The concept of sufficient descent can be generalized for use with a variety of 
Nelder—Mead variants. Instead of adopting a new vertex generated by the standard 
algorithm, one may require the new vertex to meet an additional, more stringent 
condition requiring some type of minimum improvement in the simplex [85, 479, 
518, 636]. 

Despite rare failures and relatively slow convergence speed, the Nelder-Mead 
algorithm is a very good candidate for many optimization problems. Another attrac- 
tive feature is that it can be implemented with great numerical efficiency, making it a 
feasible choice for problems where objective function evaluations are computation- 
ally expensive. In virtually every case, only two new objective function values are 
calculated: x, and one of Xe, Xo, and x;. Shrinking occurs only rarely and in this case 
p evaluations of the objective function are required. 

Finally, for statistical applications like maximum likelihood estimation it is 
important to find a variance estimate for Ô. For this purpose, numerical approximation 
of the Hessian can be completed after convergence is achieved [482, 493]. 


2.2.5 Nonlinear Gauss-Seidel Iteration 


An important technique that is used frequently for fitting nonlinear statistical models, 
including those in Chapter 12, is nonlinear Gauss—Seidel iteration. This technique is 
alternatively referred to as backfitting or cyclic coordinate ascent. 

The equation g'(x) = 0 is a system of p nonlinear equations in p unknowns. For 
j =1,..., p, Gauss-Seidel iteration proceeds by viewing the jth component of g’ as 
a univariate real function of x; only. Any convenient univariate optimization method 
can be used to solve for the one-dimensional root of g Lo) = = 0. All p components 
are cycled through in succession, and at each stage of the cycle the most recent values 
obtained for each coordinate are used. At the end of the cycle, the complete set of 
most recent values constitutes x¢+)), 
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FIGURE 2.14 Application of Gauss-Seidel iteration for maximizing a complex bivariate 
function, as discussed in Example 2.10. The surface of the function is indicated by shading 
and contours. From the starting point of x, several steps are required to approach the true 
maximum, x*. Each line segment represents a change of a single coordinate in the current 
solution, so complete steps from x to x“+) correspond to pairs of adjacent segments. 


The beauty of this approach lies in the way it simplifies a potentially difficult 
problem. The solution of univariate root-finding problems created by applying Gauss— 
Seidel iteration is generally easy to automate, since univariate algorithms tend to be 
more stable and successful than multivariate ones. Further, the univariate tasks are 
likely to be completed so quickly that the total number of computations may be less 
than would have been required for the multivariate approach. The elegance of this 
strategy means that it is quite easy to program. 


Example 2.10 (Bivariate Optimization Problem, Continued) Figure 2.14 il- 
lustrates an application of Gauss-Seidel iteration for finding the maximum of the 
bivariate function discussed in Example 2.4. Unlike other graphs in this chapter, each 
line segment represents a change of a single coordinate in the current solution. Thus, 
for example, the x“ is at the vertex following one horizontal step and one vertical 
step from x, Each complete step comprises two univariate steps. A quasi-Newton 
method was employed for each univariate optimization. Note that the very first uni- 
variate optimization (one horizontal step left from x) actually failed, finding a 
local univariate minimum instead of the global univariate maximum. Although this is 
not advised, subsequent Gauss-Seidel iterations were able to overcome this mistake, 
eventually finding the global multivariate maximum. 


The optimization of continuous multivariate functions is an area of extensive 
research, and the references given elsewhere in this chapter include a variety of ap- 
proaches not mentioned here. For example, the trust region approach constrains di- 
rections and lengths of steps. The nonlinear conjugate gradient approach chooses 
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search directions that deviate from the direction of the gradient with a bias toward 
directions not previously explored. 


PROBLEMS 


2.1. The following data are an i.i.d. sample from a Cauchy(6, 1) distribution: 1.77, —0.23, 
2.76, 3.80, 3.47, 56.75, —1.34, 4.24, —2.44, 3.29, 3.71, —2.40, 4.53, —0.07, —1.05, 
—13.87, —2.53, —1.75, 0.27, 43.21. 


2.2. 


2.3. 


a. 


Graph the log likelihood function. Find the MLE for 6 using the Newton—Raphson 
method. Try all of the following starting points: —11, —1, 0, 1.5, 4, 4.7, 7, 8, and 
38. Discuss your results. Is the mean of the data a good starting point? 


Apply the bisection method with starting points —1 and 1. Use additional runs 
to illustrate manners in which the bisection method may fail to find the global 
maximum. 


Apply fixed-point iterations as in (2.29), starting from —1, with scaling choices 
of a = 1, 0.64, and 0.25. Investigate other choices of starting values and scaling 
factors. 


From starting values of (0%, 6) = (—2, —1), apply the secant method to estimate 
0. What happens when (6, 6) = (—3, 3), and for other starting choices? 


Use this example to compare the speed and stability of the Newton—Raphson 
method, bisection, fixed-point iteration, and the secant method. Do your conclu- 
sions change when you apply the methods to a random sample of size 20 from a 
N(@, 1) distribution? 


Consider the density f(x) = [1 — cos{x — 6}]/2m on 0 < x < 2x, where 0 is a param- 
eter between —z and x. The following i.i.d. data arise from this density: 3.91, 4.85, 
2.28, 4.06, 3.70, 4.04, 5.46, 3.53, 2.28, 1.96, 2.53, 3.88, 2.22, 3.47, 4.82, 2.46, 2.99, 
2.54, 0.52, 2.50. We wish to estimate 0. 


a. 
b. 
c. 


Graph the log likelihood function between —z and v. 
Find the method-of-moments estimator of 0. 


Find the MLE for 6 using the Newton—Raphson method, using the result from 
(b) as the starting value. What solutions do you find when you start at —2.7 
and 2.7? 


Repeat part (c) using 200 equally spaced starting values between —z and z. Partition 
the interval between —z and z into sets of attraction. In other words, divide the set 
of starting values into separate groups, with each group corresponding to a separate 
unique outcome of the optimization (a local mode). Discuss your results. 


Find two starting values, as nearly equal as you can, for which the Newton—Raphson 
method converges to two different solutions. 


Let the survival time ¢ for individuals in a population have density function f and 
cumulative distribution function F. The survivor function is then S(t) = 1 — F(t). The 
hazard function is h(t) = f(t)/ (1 — F()), which measures the instantaneous risk of 
dying at time f given survival to time t. A proportional hazards model posits that 
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TABLE 2.2 Length of remission (in weeks) for acute leukemia patients in the treat- 
ment and control groups of a clinical trial, with parentheses indicating censored 
values. For censored cases, patients are known to be in remission at least as long 
as the indicated value. 


Treatment (6) 6 6 6 7 (9) (10) 


10 (11) 13 16 (17) (19) (20) 
22 23 (25) (32) (32) (34) (35) 
Control 1 2 2 3 4 4 
5 5 8 8 8 8 11 
11 12 12 15 17 22 23 


the hazard function depends on both time and a vector of covariates, x, through the 
model 


A(t|x) = A(t) exp {x™B}, 


where £ is a parameter vector. 
If A(@) = S A(u) du, it is easy to show that S(t) = exp {AO exp{x"B} } 


and f(t) = A(t)exp {x™B — A(f) exp{x"B}}. 


a. Suppose that our data are censored survival times t; fori = 1,...,n. At the end 
of the study a patient is either dead (known survival time) or still alive (censored 
time; known to survive at least to the end of the study). Define w; to be 1 if t; is an 
uncensored time and 0 if t; is a censored time. Prove that the log likelihood takes 
the form 


Mt 
ew losti) =u + mbes | \. 


where u; = A(t;)exp{x! p}. 

b. Consider a model for the length of remission for acute leukemia patients in a clin- 
ical trial. Patients were either treated with 6-mercaptopurine (6-MP) or a placebo 
[202]. One year after the start of the study, the length (weeks) of the remission 
period for each patient was recorded (see Table 2.2). Some outcomes were cen- 
sored because remission extended beyond the study period. The goal is to deter- 
mine whether the treatment lengthened time spent in remission. Suppose we set 
A(t) = t° for æ > 0, yielding a hazard function proportional to wt*~! and a Weibull 
density: f(t) = at®—! exp {x"B — t“ exp{x? B}. Adopt the covariate parameteri- 
zation given by x! B = By + 6;6; where 4; is 1 if the ith patient was in the treatment 
group and 0 otherwise. Code a Newton—-Raphson algorithm and find the MLEs of 
a, Bo, and B i: 

c. Use any prepackaged Newton—Raphson or quasi-Newton routine to solve for the 
same MLEs. 


d. Estimate standard errors for your MLEs. Are any of your MLEs highly correlated? 
Report the pairwise correlations. 


e. Use nonlinear Gauss-Seidel iteration to find the MLEs. Comment on the implemen- 
tation ease of this method compared to the multivariate Newton—Raphson method. 
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CHAPTER 2 OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS 


TABLE 2.3 Counts of flour beetles in all stages of development over 154 days. 


Days 0 8 28 41 63 79 97 117 135 154 
Beetles 2 47 192 256 768 896 1120 896 1184 1024 


f. Use the discrete Newton method to find the MLEs. Comment on the stability of this 
method. 


A parameter 6 has a Gamma(2, 1) posterior distribution. Find the 95% highest posterior 
density interval for 6, that is, the interval containing 95% of the posterior probability for 
which the posterior density for every point contained in the interval is never lower than 
the density for every point outside the interval. Since the gamma density is unimodal, 
the interval is also the narrowest possible interval containing 95% of the posterior 
probability. 


There were 46 crude oil spills of at least 1000 barrels from tankers in U.S. waters during 
1974-1999. The website for this book contains the following data: the number of spills 
in the ith year, N;; the estimated amount of oil shipped through US waters as part of 
US import/export operations in the ith year, adjusted for spillage in international or 
foreign waters, b;,; and the amount of oil shipped through U.S. waters during domestic 
shipments in the ith year, b;2. The data are adapted from [11]. Oil shipment amounts 
are measured in billions of barrels (Bbbl). 

The volume of oil shipped is a measure of exposure to spill risk. Suppose 
we use the Poisson process assumption given by N;|b;, bi2 ~ Poisson(A;) where 
Ài = œb + &2bn. The parameters of this model are œ; and a, which represent the rate 
of spill occurrence per Bbbl oil shipped during import/export and domestic shipments, 
respectively. 


a. Derive the Newton—Raphson update for finding the MLEs of a and a. 
b. Derive the Fisher scoring update for finding the MLEs of a and a. 


c. Implement the Newton—Raphson and Fisher scoring methods for this problem, 
provide the MLEs, and compare the implementation ease and performance of the 
two methods. 


d. Estimate standard errors for the MLEs of a; and a. 
e. Apply the method of steepest ascent. Use step-halving backtracking as necessary. 


f. Apply quasi-Newton optimization with the Hessian approximation update given in 
(2.49). Compare performance with and without step halving. 


g. Construct a graph resembling Figure 2.8 that compares the paths taken by methods 
used in (a)-(f). Choose the plotting region and starting point to best illustrate the 
features of the algorithms’ performance. 


Table 2.3 provides counts of a flour beetle (Tribolium confusum) population at various 
points in time [103]. Beetles in all stages of development were counted, and the food 
supply was carefully controlled. 

An elementary model for population growth is the logistic model given by 


ey N (1 =) (2.63) 
a K)’ i 


2.7. 


2.8. 
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where N is population size, t is time, r is a growth rate parameter, and K is a parameter 
that represents the population carrying capacity of the environment. The solution to 
this differential equation is given by 

KNo 


N= SO = NER- Nep ri ae 


where N, denotes the population size at time t. 


a. Fit the logistic growth model to the flour beetle data using the Gauss—Newton 
approach to minimize the sum of squared errors between model predictions and 
observed counts. 


b. Fit the logistic growth model to the flour beetle data using the Newton—Raphson 
approach to minimize the sum of squared errors between model predictions and 
observed counts. 


c. In many population modeling applications, an assumption of lognormality is 
adopted. The simplest assumption would be that the log N, are independent and 
normally distributed with mean log f(t) and variance o°. Find the MLEs under 
this assumption, using both the Gauss—Newton and the Newton—Raphson methods. 
Provide standard errors for your parameter estimates, and an estimate of the corre- 
lation between them. Comment. 


Himmelblau’s function is f(x, y) = (x? + y — 11)? + (x + y? — 7). This function has 
four minima with a local maximum amid them. Illustrate how performance of the 
Nelder—Mead algorithm can differ depending on the choices for w,, œe, and œe, for the 
following tasks. 


a. Demonstrate effects for finding a minimum of this function. 


b. Demonstrate effects for finding the local maximum of this function. How well would 
a derivative-based procedure work in this case? Show examples. 


Comment on your results. 


Recall that for a two-dimensional problem, the Nelder—Mead algorithm maintains at 
each iteration a set of three possible solutions defining the vertices of a simplex, specif- 
ically a triangle. Let us consider whether three is a good choice. Imagine an algorithm 
for two-dimensional optimization that maintains four points defining the vertices of 
a convex quadrilateral and is similar to Nelder—Mead in spirit. Speculate how such a 
procedure could proceed. Consider sketches like those shown in Figure 2.10. What are 
some of the inherent challenges? There is no correct answer here; the purpose is to 
brainstorm and see where your ideas lead. 


CHAPTER 3 


COMBINATORIAL OPTIMIZATION 


It is humbling to learn that there are entire classes of optimization problems for which 
most methods—including those described previously—are utterly useless. 

We will pose these problems as maximizations except in Section 3.3, although 
in nonstatistical contexts minimization is often customary. For statistical applications, 
recall that maximizing the log likelihood is equivalent to minimizing the negative log 
likelihood. 

Let us assume that we are seeking the maximum of f(0) with respect to 0 = 
(01, ..., 0p), where 0 € © and © consists of N elements for a finite positive integer 
N. In statistical applications, it is not uncommon for a likelihood function to depend 
on configuration parameters that describe the form of a statistical model and for 
which there are many discrete choices, as well as a small number of other parameters 
that could be easily optimized if the best configuration were known. In such cases, 
we may view f(@) as the log profile likelihood of a configuration, 0, that is, the 
highest likelihood attainable using that configuration. Section 3.1.1 provides several 
examples. 

Each 0 € © is termed a candidate solution. Let fmax denote the globally max- 
imum value of f(0) achievable for 0 € ©, and let the set of global maxima be 
M= {0EO: f(0)= fma}. Ifthere are ties, M will contain more than one element. 
Despite the finiteness of ©, finding an element of M may be very hard if there are 
distracting local maxima, plateaus, and long paths toward optima in ©, and if N is 
extremely large. 


3.1 HARD PROBLEMS AND NP-COMPLETENESS 


Hard optimization problems are generally combinatorial in nature. In such problems, 
p items may be combined or sequenced in a very large number of ways, and each 
choice corresponds to one element in the space of possible solutions. Maximization 
requires a search of this very large space. 

For example, consider the traveling salesman problem. In this problem, the 
salesman must visit each of p cities exactly once and return to his point of origin, 
using the shortest total travel distance. We seek to minimize the total travel distance 
over all possible routes (i.e., maximize the negative distance). If the distance between 
two cities does not depend on the direction traveled between them, then there are 
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(p — 1)!/2 possible routes (since the point of origin and direction of travel are arbi- 
trary). Note that any tour corresponds to a permutation of the integers 1,..., p, which 
specifies the sequence in which the cities are visited. 

To consider the difficulty of such problems, it is useful to discuss the number 
of steps required for an algorithm to solve it, where steps are simple operations like 
arithmetic, comparisons, and branching. The number of operations depends, of course, 
on the size of the problem posed. In general the size of a problem may be specified 
as the number of inputs needed to pose it. The traveling salesman problem is posed 
by specifying p city locations to be sequenced. The difficulty of a particular size- p 
problem is characterized by the number of operations required to solve it in the worst 
case using the best known algorithm. 

The number of operations is only a rough notion, because it varies with imple- 
mentation language and strategy. It is conventional, however, to bound the number 
of operations using the notation O(h(:p)). If h(p) is polynomial in p, an algorithm is 
said to be polynomial. 

Although the actual running time on a computer depends on the speed of the 
computer, we generally equate the number of operations and the execution time by 
relying on the simplifying assumption that all basic operations take the same amount 
of time (one unit). Then we may make meaningful comparisons of algorithm speeds 
even though the absolute scale is meaningless. 

Consider two problems of size p = 20. Suppose that the first problem can be 
solved in polynomial time [say O(p*) operations], and the solution requires 1 minute 
on your office computer. Then the size-21 problem could be solved in just a few 
seconds more. The size-25 problem can be solved in 1.57 minutes, size 30 in 2.25 
minutes, and size 50 in 6.25 minutes. Suppose the second problem is O(p!) and 
requires 1 minute for size 20. Then it would take 21 minutes for size 21, 12.1 years 
(6,375,600 minutes) for size 25, 207 million years for size 30, and 2.4 x 10% years for 
size 50. Similarly, if an O(p!) traveling salesman problem of size 20 could be solved 
in | minute, it would require far longer than the lifetime of the universe to determine 
the optimal path for the traveling salesman to make a tour of the 50 U.S. state capitals. 
Furthermore, obtaining a computer that is 1000 times faster would barely reduce the 
difficulty. The conclusion is stark: Some optimization problems are simply too hard. 
The complexity of a polynomial problem—even for large p and high polynomial 
order—is dwarfed by the complexity of a quite small nonpolynomial problem. 

The theory of problem complexity is discussed in [214, 497]. For us to dis- 
cuss this issue further, we must make a formal distinction between optimization (i.e., 
search) problems and decision (i.e., recognition) problems. Thus far, we have con- 
sidered optimization problems of the form: “Find the value of 0 € © that maximizes 
Ff(0).” The decision counterpart to this is: “Is there a 9 € © for which f(0) > c, for 
a fixed number c?” Clearly there is a close relationship between these two versions 
of the problem. In principle, we could solve the optimization problem by repeatedly 
solving the decision problem for strategically chosen values of c. 

Decision problems that can be solved in polynomial time [e.g., O(p*) operations 
for p inputs and constant k] are generally considered to be efficiently solvable [214]. 
These problems belong to the class denoted P. Once any polynomial-time algorithm 
has been identified for a problem, the order of the polynomial is often quickly reduced 
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to practical levels [497]. Decision problems for which a given solution can be checked 
in polynomial time are called NP problems. Clearly a problem in P is in NP. However, 
there seem to be many decision problems, like the traveling salesman problem, that 
are much easier to check than they are to solve. In fact, there are many NP problems 
for which no polynomial-time solution has ever been developed. Many NP problems 
have been proven to belong to a special class for which a polynomial algorithm found 
to solve one such problem could be used to solve all such problems. This is the class 
of NP-complete problems. There are other problems at least as difficult, for which a 
polynomial algorithm—if found—would be known to provide a solution to all NP- 
complete problems, even though the problem itself is not proven to be NP-complete. 
These are NP-hard problems. There are also many combinatorial decision problems 
that are difficult and probably NP-complete or NP-hard although they haven’t been 
proven to be in these classes. Finally, optimization problems are no easier than their 
decision counterparts, and we may classify optimization problems using the same 
categories listed above. 

It has been shown that if there is a polynomial algorithm for any NP-complete 
problem, then there are polynomial algorithms for all NP-complete problems. The 
utter failure of scientists to develop a polynomial algorithm for any NP-complete 
problem motivates the popular conjecture that there cannot be any polynomial algo- 
rithm for any NP-complete problem. Proof (or counterexample) of this conjecture is 
one of the great unsolved problems in mathematics. 

This leads us to the realization that there are optimization problems that are 
inherently too difficult to solve exactly by traditional means. Many problems in bioin- 
formatics, experimental design, and nonparametric statistical modeling, for example, 
require combinatorial optimization. 


3.1.1 Examples 


Statisticians have been slow to realize how frequently combinatorial optimization 
problems are encountered in mainstream statistical model-fitting efforts. Below we 
give two examples. In general, when fitting a model requires optimal decisions about 
the inclusion, exclusion, or arrangement of a number of parameters in a set of possible 
parameters, combinatorial optimization problems arise frequently. 


Example 3.1 (Genetic Mapping) Genetic data for individuals and groups of related 
individuals are often analyzed in ways that present highly complex combinatorial 
optimization problems. For example, consider the problem of locating genes on a 
chromosome, known as the genetic mapping problem. 

The genes, or more generally genetic markers, of interest in a chromosome 
can be represented as a sequence of symbols. The position of each symbol along the 
chromosome is called its locus. The symbols indicate genes or genetic markers, and 
the particular content stored at a locus is an allele. 

Diploid species like humans have pairs of chromosomes and hence two alleles 
at any locus. An individual is homozygous at a locus if the two alleles are identical 
at this locus; otherwise the individual is heterozygous. In either case, each parent 
contributes one allele at each locus of an offspring’s chromosome pair. There are 
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Parent’s Chromosome Parent’s Contribution 
to Offspring 
0 00 0 0 
pee SSN e a 
> nvo 0 00 1 1 
SS 
1 11 1 1 Meiosis 


FIGURE 3.1 During meiosis, a crossover occurs between the third and fourth loci. The zeros 
and ones indicate the origin of each allele in the contributed chromosome. Only one parental 
contribution is shown, for simplicity. 


two possible contributions from any parent, because the parent has two alleles at the 
corresponding locus in his/her chromosome pair. Although each parent allele has a 
50% chance of being contributed to the offspring, the contributions from a particular 
parent are not made independently at random. Instead, the contribution by a parent 
consists of a chromosome built during meiosis from segments of each chromosome in 
the parent’s pair of chromosomes. These segments will contain several loci. When the 
source of the alleles on the contributed chromosome changes from one chromosome 
of the parent’s pair to the other one, a crossover is said to have occurred. Figure 3.1 
illustrates a crossover occurring during meiosis, forming the chromosome contributed 
to the offspring by one parent. This method of contribution means that alleles whose 
loci are closer together on one of the parent’s chromosomes are more likely to appear 
together on the chromosome contributed by that parent. 

When the alleles at two loci of a parent’s chromosome appear jointly on the 
contributed chromosome more frequently than would be expected by chance alone, 
they are said to be linked. When the alleles at two different loci of a parent’s chro- 
mosome do not both appear in the contributed chromosome, a recombination has 
occurred between the loci. The frequency of recombinations determines the degree 
of linkage between two loci: Infrequent recombination corresponds to strong linkage. 
The degree of linkage, or map distance, between two loci corresponds to the expected 
number of crossovers between the two loci. 

A genetic map of p markers consists of an ordering of their loci and a list of 
distances or probabilities of recombination between adjacent loci. Assign to each 
locus a label, £, for £ = 1,..., p. The ordering component of the map, denoted 
0 = (61,..., 0p), describes the arrangement of the p locus labels in order of their 
positions along the chromosome, with 0; = £ if the locus labeled £ lies at the jth po- 
sition along the chromosome. Thus, @ is a permutation of the integers 1, ..., p. The 
other component of a genetic map is a list of distances between adjacent loci. Denote 
the probability of recombination between adjacent loci 0; and 0j41 as d(6;, 0j+1). 
This amounts to the map distance between these loci. Figure 3.2 illustrates this 
notation. 

Such a map can be estimated by observing the alleles at the p loci for a sample 
of n chromosomes generated during the meiosis from a parent that is heterozygous 
at all p loci. Each such chromosome can be represented by a sequence of zeros and 
ones, indicating the origin of each allele in the contributed parent. For example, the 
chromosome depicted on the right side of Figure 3.1 can be denoted 00011, because 
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Position, j 1 2 3 4 
6; = 0, =3 0, =1 03 =4 64 =2 
Locus Label, £ 
Åm eee eee NSA 
Distance, d(6;,6;,1) d(3,1) d(1,4) d(4,2) 


FIGURE 3.2 Notation for gene mapping example with p = 4 loci. The loci are labeled in 
boxes at their positions along the chromosome. The correct sequential ordering of loci is defined 
by the 0; values. Distances between loci are given by d(6;, 041) for j = 1,..., 3. 


the first three alleles originate from the first chromosome of the parent and the final 
two alleles originate from the second chromosome of the parent. 

Let the random variable X; 6 j denote the origin of the allele in the locus labeled 6; 
for the ith chromosome generated during meiosis. The dataset consists of observations, 
Xiop of these random variables. Thus, a recombination for two adjacent markers 
has been observed in the ith case if |x; o; T Xio a 1, and no recombination has 
been observed if |x; o; T Xij | = 0. If recombination events are assumed to occur 
independently in each interval, the probability of a given map is 


p-ln 


lI I (1 — 4(6j, 8}41)) (1 = [xio = Xio l) + Oj, ODl; = Xio i 


j=l i=1 
(3.1) 


Given an ordering 0, the MLEs for the recombination probabilities are easily found 
to be 


Tae 
AO, 8}41) = = D lio; = iojn l (3.2) 
i=1 


Given d(0;, 0j+1), the number of recombinations between the loci in positions j and 
jt lis St |Xi,0; — Xib |, which has a Bin(n, d(6;, 0j+1)) distribution. We can 
compute the profile likelihood for 0 by adding the log likelihoods of the p — 1 sets of 
adjacent loci and replacing each d(6;, 0j+1) by its conditional maximum likelihood 


estimate d (0j, 0j+1). Let àO) compute these maximum likelihood estimates for any 
0. Then the profile likelihood for 0 is 


p—1 
1(6\d(6)) = X` n {40}, 0j41) log{d(;, 9}41)} 
j=l 


+ (1 — d(6;, 6)+1)) log{1 — d(6;, 8j41)}} 


p-l 
= X T@};, 641); (3.3) 
j=l 
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where T(0;, 0;+1) is defined to be zero if (0 j, 0j+1) is zero or one. Then the maximum 
likelihood genetic map is obtained by maximizing (3.3) over all permutations 6. Note 
that (3.3) constitutes a sum of terms 7(0;, 0j+1) whose values depend on only two 
loci. Suppose that all possible pairs of loci are enumerated, and the value T(i, j) is 
computed for every i and j where 1 <i < j < p. There are p(p — 1)/2 such values of 
T(i, j). The profile log likelihood can then be computed rapidly for any permutation 
0 by summing the necessary values of T(i, j). 

However, finding the maximum likelihood genetic map requires maximizing 
the profile likelihood by searching over all p!/2 possible permutations. This is a 
variant of the traveling salesman problem, where each genetic marker corresponds to 
a city and the distance between cities i and jis T(i, j). The salesman’s tour may start 
at any city and terminates in the last city visited. A tour and its reverse are equivalent. 
There are no known algorithms for solving general traveling salesman problems in 
polynomial time. 

Further details and extensions of this example are considered in [215, 572]. 


Example 3.2 (Variable Selection in Regression) Consider a multiple linear 
regression problem with p potential predictor variables. A fundamental step in 
regression is selection of a suitable model. Given a dependent variable Y and a set 
of candidate predictors x1, x2,...,Xp, we must find the best model of the form 
Y= bo + X= Êi;Xi; + €, where {i1, .. . , is} is a subset of {1,..., p} and € denotes 
a random error. The notion of what model is best may have any of several meanings. 

Suppose that the goal is to use the Akaike information criterion (AIC) to select 
the best model [7, 86]. We seek to find the subset of predictors that minimizes the 
fitted model AIC, 


AIC = N log{RSS/N} + 2(s + 2), (3.4) 


where N is the sample size, s is the number of predictors in the model, and RSS 
is the sum of squared residuals. Alternatively, suppose that Bayesian regression is 
performed, say with the normal-gamma conjugate class of priors B ~ N(w, o? V) and 
vi Jo? ~ x2. In this case, one might seek to find the subset of predictors corresponding 
to the model that maximizes the posterior model probability [527]. 

In either case, the variable selection problem requires an optimization over a 
space of 2?+! possible models, since each variable and the intercept may be included 
or omitted. It also requires estimating the best ĝ;, for each of the 2P+! possible 
models, but this step is easy for any given model. Although a search algorithm that is 
more efficient than exhaustive search has been developed to optimize some classical 
regression model selection criteria, it is practical only for fairly small p [213, 465]. 
We know of no efficient general algorithm to find the global optimum (i.e., the single 
best model) for either the AIC or the Bayesian goals. 


3.1.2 Need for Heuristics 


The existence of such challenging problems requires a new perspective on optimiza- 
tion. It is necessary to abandon algorithms that are guaranteed to find the global 
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maximum (under suitable conditions) but will never succeed within a practical time 
limit. Instead we turn to algorithms that can find a good local maximum within toler- 
able time. 

Such algorithms are sometimes called heuristics. They are intended to find a 
globally competitive candidate solution (i.e., a nearly optimal one), with an explicit 
trade of global optimality for speed. The two primary features of such heuristics are 


1. iterative improvement of a current candidate solution, and 


2. limitation of the search to a local neighborhood at any particular iteration. 


These two characteristics embody the heuristic strategy of local search, which we 
address first. 

No single heuristic will work well in all problems. In fact, there is no search 
algorithm whose performance is better than another when performance is averaged 
over the set of all possible discrete functions [576, 672]. There is clearly a motivation 
to adopt different heuristics for different problems. Thus we continue beyond local 
search to examine simulated annealing, genetic algorithms, and tabu algorithms. 


3.2 LOCAL SEARCH 


Local search is a very broad optimization paradigm that arguably encompasses all 
of the techniques described in this chapter. In this section, we introduce some of the 
its simplest, most generic variations such as k-optimization and random starts local 
search, 

Basic local search is an iterative procedure that updates a current candidate 
solution 6 at iteration t to 0°+)), The update is termed a move or a step. One or more 
possible moves are identified from a neighborhood of 0, say MO®). The advantage 
of local search over global (i.e., exhaustive) search is that only a tiny portion of O 
need be searched at any iteration, and large portions of © may never be examined. 
The disadvantage is that the search is likely to terminate at an uncompetitive local 
maximum. 

A neighborhood of the current candidate solution, NO), contains candidate 
solutions that are near 0. Often, proximity is enforced by limiting the number of 
changes to the current candidate solution used to generate an alternative. In practice, 
simple changes to the current candidate solution are usually best, resulting in small 
neighborhoods that are easily searched or sampled. Complex alterations are often 
difficult to conceptualize, complicated to code, and slow to execute. Moreover, they 
rarely improve search performance, despite the intuition that larger neighborhoods 
would be less likely to lead to entrapment at a poor local maximum. If the neighbor- 
hood is defined by allowing as many as k changes to the current candidate solution 
in order to produce the next candidate, then it is a k-neighborhood, and the alteration 
of those features is called a k-change. 

The definition a neighborhood is intentionally vague to allow flexible usage 
of the term in a wide variety of problems. For the gene mapping problem intro- 
duced in Example 3.1, suppose 6 is a current ordering of genetic markers. A simple 
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neighborhood might be the set of all orderings that can be obtained by swapping the 
locations of only two markers on the chromosome whose order is 9. In the regres- 
sion model selection problem introduced in Example 3.2, a simple neighborhood is 
the set of models that either add or omit one predictor from 6. 

A local neighborhood will usually contain several candidate solutions. An ob- 
vious strategy at each iteration is to choose the best among all candidates in the 
current neighborhood. This is the method of steepest ascent. To speed performance, 
one might instead select the first randomly chosen neighbor for which the objective 
function exceeds its previous value; this is random ascent or next ascent. 

If k-neighborhoods are used for a steepest ascent algorithm, the solution is said 
to be k-optimal. Alternatively, any local search algorithm that chooses et) uphill 
from 6 is an ascent algorithm, even if the ascent is not the steepest possible within 
NO). 

The sequential selection of steps that are optimal in small neighborhoods, dis- 
regarding the global problem, is reminiscent of a greedy algorithm. A chess player 
using a greedy algorithm might look for the best immediate move with total disre- 
gard to its future consequences: perhaps moving a knight to capture a pawn without 
recognizing that the knight will be captured on the opponent’s next move. Wise selec- 
tion of a new candidate solution from a neighborhood of the current candidate must 
balance the need for a narrow focus enabling quick moves against the need to find a 
globally competitive solution. To avoid entrapment in poor local maxima, it might be 
reasonable—every once in a while—to eschew some of the best neighbors of 6 in 
favor of a direction whose rewards are only later realized. For example, when 6 is a 
local maximum, the approach of steepest ascent/mildest descent [306] allows a move 
to the least unfavorable 90°F!) e NO) (see Section 3.5). There are also a variety of 
techniques in which a candidate neighbor is selected from N(6) and a random deci- 
sion rule is used to decide whether to adopt it or retain 0. These algorithms generate 
Markov chains {8} (t =0,1,...) that are closely related to simulated annealing 
(Section 3.3) and the methods of Chapter 7. 

Searching within the current neighborhood for a k-change steepest ascent move 
can be difficult when k is greater than | or 2 because the size of the neighborhood 
increases rapidly with k. For larger k, it can be useful to break the k-change up into 
smaller parts, sequentially selecting the best candidate solutions in smaller neigh- 
borhoods. To promote search diversity, breaking a k-change step into several smaller 
sequential changes can be coupled with the strategy of allowing one or more of 
the smaller steps to be suboptimal (e.g., random). Such variable-depth local search 
approaches permit a potentially better step away from the current candidate solution, 
even though it will not likely be optimal within the k-neighborhood. 

Ascent algorithms frequently converge to local maxima that are not globally 
competitive. One approach to overcoming this problem is the technique of random 
starts local search. Here, a simple ascent algorithm is repeatedly run to termination 
from a large number of starting points. The starting points are chosen randomly. The 
simplest approach is to select starting points independently and uniformly at random 
over ©. More sophisticated approaches may employ some type of stratified sampling 
where the strata are identified from some pilot runs in an effort to partition © into 
regions of qualitatively different convergence behavior. 
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TABLE 3.1 Potential predictors of baseball players’ salaries. 


. Batting average 10. Strikeouts (SOs) 19. Walks per SO 


1 

2. On base pct. (OBP) 11. Stolen bases (SBs) 20. OBP / errors 
3. Runs scored 12. Errors 21. Runs per error 
4. Hits 13. Free agency“ 22. Hits per error 
5. Doubles 14. Arbitration? 23. HRs per error 
6. Triples 15. Runs per SO 24. SOs x errors 
7. Home runs (HRs) 16. Hits per SO 25. SBs x OBP 
8. Runs batted in (RBIs) 17. HRs per SO 26. SBs x runs 

9. Walks 18. RBIs per SO 27. SBs x hits 


“Free agent, or eligible. 


» Arbitration, or eligible. 


It may seem unsatisfying to rely solely on random starts to avoid being fooled 
by a local maximum. In later sections we introduce methods that modify local search 
in ways that provide a reasonable chance of finding a globally competitive candidate 
solution—possibly the global maximum—on any single run. Of course, the strategy 
of using multiple random starts can be overlaid on any of these approaches to provide 
additional confidence in the best solution found. Indeed, we recommend that this is 
always done when feasible. 


Example 3.3 (Baseball Salaries) Random starts local search can be very effective 
in practice because it is simple to code and fast to execute, allowing time for a large 
number of random starts. Here, we consider its application to a regression model 
selection problem. 

Table 3.1 lists 27 baseball performance statistics, such as batting percentages 
and numbers of home runs, which were collected for 337 players (no pitchers) in 1991. 
Players’ 1992 salaries, in thousands of dollars, may be related to these variables 
computed from the previous season. These data, derived from the data in [654], 
may be downloaded from the website for this book. We use the log of the salary 
variable as the response variable. The goal is to find the best subset of predictors 
to predict log salary using a linear regression model. Assuming that the intercept 
will be included in any model, there are 227 = 134,217,728 possible models in the 
search space. 

Figure 3.3 illustrates the application of a random starts local search algorithm 
to minimize the AIC with respect to regression variable selection. The problem can be 
posed as maximizing the negative of the AIC, thus preserving our preference for uphill 
search. Neighborhoods were limited to 1-changes generated from the current model by 
either adding or deleting one predictor. Search was started from 5 randomly selected 
subsets of predictors (i.e., five starting points), and 14 additional steps were allocated 
to each start. Each move was made by steepest ascent. Since each steepest ascent step 
requires searching 27 neighbors, this small example requires 1890 evaluations of the 
objective function. A comparable limit to objective function evaluations was imposed 
on examples of other heuristic techniques that follow in the remainder of this chapter. 
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FIGURE 3.3 Results of random starts local search by steepest ascent for Example 3.3, for 
15 iterations from each of five random starts. Only AIC values between —360 and —420 are 
shown. 


Figure 3.3 shows the value of the AIC for the best model at each step. Table 3.2 
summarizes the results of the search. The second and third random starts (labeled LS 
(2,3)) led to an optimal AIC of —418.95, derived from the model using predictors 
2, 3, 6, 8, 10, 13, 14, 15, 16, 24, 25, and 26. The worst random start was the first, 
which led to an AIC of —413.04 for a model with 10 predictors. For the sake of 
comparison, a greedy stepwise method (the step() procedure in S-Plus [642]) 
chose a model with 12 predictors, yielding an AIC of —418.94. The greedy stepwise 
method of Efroymson [465] chose a model with 9 predictors, yielding an AIC of 
—402.16; however, this method is designed to find a good parsimonious model using 
a criterion that differs slightly from the AIC. With default settings, neither of these 
off-the-shelf algorithms found a model quite as good as the one found with a simple 
random starts local search. 


3.3 SIMULATED ANNEALING 


Simulated annealing is a popular technique for combinatorial optimization because 
it is generic and easily implemented in its simplest form. Also, its limiting behavior 
is well studied. On the other hand, this limiting behavior is not easily realized in 
practice, the speed of convergence can be maddeningly slow, and complex esoteric 
tinkering may be needed to substantially improve the performance. Useful reviews of 
simulated annealing include [75, 641]. 

Annealing is the process of heating up a solid and then cooling it slowly. When a 
stressed solid is heated, its internal energy increases and its molecules move randomly. 
If the solid is then cooled slowly, the thermal energy generally decreases slowly, but 
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TABLE 3.2 Results of random starts local search model selection for Example 3.3. The bullets 
indicate inclusion of the corresponding predictor in each model selected, with model labels 
explained in the text. In addition, all models in this table included predictors 3, 8, 13 and 14. 


Predictors selected 


Method 1 2 6 7 9 10 12 15 16 18 19 20 21 22 24 25 26 27 AIC 


LS (2,3) oe ° °- o e o o —418.95 
S-Plus o e e ee eo o o —418.94 
LS (5) o o o o ° ° e e —416.15 
LS (4) ee e o o o —415.52 
LS (1) ° ° eo o o o —413.04 
Efroy. ° ° ° e ° —402.16 


there are also random increases governed by Boltzmann’s probability. Namely, at 
temperature t, the probability density of an increase in energy of magnitude AE is 
exp{—AE/kt} where k is Boltzmann’s constant. If the cooling is slow enough and 
deep enough, the final state is unstressed, where all the molecules are arranged to 
have minimal potential energy. 

For consistency with the motivating physical process, we pose optimization as 
minimization in this section, so the minimum of (0) is sought over 0 € ©. Then it is 
possible to draw an analogy between the physical cooling process and the process of 
solving a combinatorial minimization problem [130, 378]. For simulated annealing 
algorithms, 0 corresponds to the state of the material, f(@) corresponds to its energy 
level, and the optimal solution corresponds to the 6 that has minimum energy. Random 
changes to the current state (i.e., moves from 0 to 0°+) are governed by the Boltz- 
mann distribution given above, which depends on a parameter called temperature. 
When the temperature is high, acceptance of uphill moves (i.e., moves to a higher 
energy state) are more likely to be tolerated. This discourages convergence to the 
first local minimum that happens to be found, which might be premature if the space 
of candidate solutions has not yet been adequately explored. As search continues, 
the temperature is lowered. This forces increasingly concentrated search effort near 
the current local minimum, because few uphill moves will be allowed. If the cooling 
schedule is determined appropriately, the algorithm will hopefully converge to the 
global minimum. 

The simulated annealing algorithm is an iterative procedure started at time 
t = 0 with an initial point 0 anda temperature to. Iterations are indexed by t. The 
algorithm is run in stages, which we index by j = 0, 1, 2, ..., and each stage consists 
of several iterations. The length of the jth stage is mj. Each iteration proceeds as 
follows: 


1. Select a candidate solution 6* within the neighborhood of 00, say NO), 
according to a proposal density g- | 9). 
2. Randomly decide whether to adopt 6* as the next candidate solution or to keep 


another copy of the current solution. Specifically, let 9°+) = 6* with probabil- 
ity equal to min (1, exp {[ (0) — f(6*)]/t;}). Otherwise, let 6+” = 6. 
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3. Repeat steps 1 and 2 a total of m j times. 
4. Increment j. Update tj = o(tj-1) and m; = B(mj_1). Go to step 1. 


If the algorithm is not stopped according to a limit on the total number of iterations 
or a predetermined schedule of t; and m j, one can monitor an absolute or relative 
convergence criterion (see Chapter 2). Often, however, the stopping rule is expressed 
as a minimum temperature. After stopping, the best candidate solution found is the 
estimated minimum. 

The function « should slowly decrease the temperature to zero. The number of 
iterations at each temperature (m j) should be large and increasing in j. Ideally, the 
function £ should scale the m ; exponentially in p, but in practice some compromises 
will be required in order to obtain tolerable computing speed. 

Although the new candidate solution is always adopted when it is superior to the 
current solution, note that it has some probability of being adopted even when it is in- 
ferior. In this sense, simulated annealing is a stochastic descent algorithm. Its random- 
ness allows simulated annealing sometimes to escape uncompetitive local minima. 


3.3.1 Practical Issues 


3.3.1.1 Neighborhoods and Proposals Strategies for choosing neighbor- 
hoods can be very problem specific, but the best neighborhoods are usually small 
and easily computed. 

Consider the traveling salesman problem. Numbering the cities 1, 2,..., p, any 
tour 0 can be written as a permutation of these integers. The cities are linked in this 
order, with an additional link between the final city visited and the original city where 
the tour began. A neighbor of 0 can be generated by removing two nonadjacent links 
and reconnecting the tour. In this case, there is only one way to obtain a valid tour 
through reconnection: One of the tour segments must be reversed. For example, the 
tour 143256 is a neighbor of the tour 123456. Since two links are altered, the process 
of generating such neighbors is a 2-change, and it yields a 2-neighborhood. Any tour 
has p(p — 3)/2 unique 2-change neighbors distinct from 6 itself. This neighborhood 
is considerably smaller than the (p — 1)!/2 tours in the complete solution space. 

It is critical that the chosen neighborhood structure allows all solutions in O 
to communicate. For 6; and 0; to communicate, it must be possible to find a finite 
sequence of solutions 01, ... , Og such that 0; € N(0;), 02 € N(01),...,O% E N(Ox-1), 
and 0; € MOr). The 2-neighborhoods mentioned above for the traveling salesman 
problem allow communication between any 0; and 0;. 

The most common proposal density, g(- | 6), is discrete uniform—a candi- 
date is sampled completely at random from (6). This has the advantage of speed 
and simplicity. Other, more strategic methods have also been suggested [281, 282, 
659]. 

Rapid updating of the objective function is an important strategy for speeding 
simulated annealing runs. In the traveling salesman problem, sampling a 2-neighbor 
at random amounts to selecting two integers from which is derived a permutation 
of the current tour. Note also for the traveling salesman problem that /(6*) can 
be efficiently calculated for any 6* in the 2-neighborhood of 0 when f(6™) has 
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already been found. In this case, the new tour length equals the old tour length minus 
the distance for traveling the two broken links, plus the distance for traveling the two 
new links. The time to compute this does not depend on problem size p. 


3.3.1.2 Cooling Schedule and Convergence The sequence of stage lengths 
and temperatures is called the cooling schedule. Ideally, the cooling schedule should 
be slow. 

The limiting behavior of simulated annealing follows from Markov chain the- 
ory, briefly reviewed in Chapter 1. Simulated annealing can be viewed as producing 
a sequence of homogeneous Markov chains (one at each temperature) or a single 
inhomogeneous Markov chain (with temperature decreasing between transitions). 
Although these views lead to different approaches to defining limiting behavior, both 
lead to the conclusion that the limiting distribution of draws has support only on the 
set of global minima. 

To understand why cooling should lead to the desired convergence of the 
algorithm at a global minimum, first consider the temperature to be fixed at t. Sup- 
pose further that proposing 0; from MO j) has the same probability as proposing 0j 
from M(6;) for any pair of solutions 6; and 6; in ©. In this case, the sequence of 
6 generated by simulated annealing is a Markov chain with stationary distribution 
(0) x exp{—f(0)/t}. This means that lim;— oo P[0® = 0] = 2,(0). This approach 
to generating a sequence of random values is called the Metropolis algorithm and is 
discussed in Section 7.1. 

In principle, we would like to run the chain at this fixed temperature long 
enough that the Markov chain is approximately in its stationary distribution before 
the temperature is reduced. 

Suppose there are M global minima and the set of these solutions is M. Denote 
the minimal value of f on © as fmin. Then the stationary distribution of the chain for 
a fixed Tt is given by 


exp{— [f(0i) — fmin] / T} 


(0; = 
TARR M+® jg m expl— [SO — fmin] / 7} 


(3.5) 


for each 0; € ©. 
Now, as t — 0 from above, the limit of exp{— [f0 — fmin] / thisOifi g M 
and 1 if i € M. Thus 


; 1/M ifie M, 
ma T:(0i) = (3.6) 


0 otherwise. 


The mathematics to make these arguments precise can be found in [67, 641]. 

It is also possible to relate the cooling schedule to a bound on the quality of 
the final solution. If one wishes any iterate to have not more than probability ô in 
equilibrium of being worse than the global minimum by no more than e, this can be 
achieved if one cools until t; < €/log{(N — 1)/5}, where N is the number of points 
in © [426]. In other words, this t; ensures that the final Markov chain configuration 
will in equilibrium have P [f(0) > fmin + €] < ô. 
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If neighborhoods communicate and the depth of the deepest local (and non- 
global) minimum is c, then the cooling schedule given by t = c/log{1 + i} guaran- 
tees asymptotic convergence, where i indexes iterations [292]. The depth of a local 
minimum is defined to be the smallest increase in the objective function needed to 
escape from that local minimum into the valley of any other minimum. However, 
mathematical bounds on the number of iterations required to achieve a high prob- 
ability of having discovered at least one element of M often exceed the size of O 
itself. In this case, one cannot establish that simulated annealing will find the global 
minimum more quickly than an exhaustive search [33]. 

If one wishes the Markov chain generated by simulated annealing to be approx- 
imately in its stationary distribution at each temperature before reducing temperature, 
then the length of the run ideally should be at least quadratic in the size of the solution 
space [1], which itself is usually exponential in problem size. Clearly, much shorter 
stage lengths must be chosen if simulated annealing is to require fewer iterations than 
exhaustive search. 

In practice, many cooling schedules have been tried [641]. Recall that the 
temperature at stage j is Tj = a(tj;-1) and the number of iterations in stage j is 
mj = B(mj;—1). One popular approach is to set m; = 1 for all j and reduce the tem- 
perature very slowly according to a(tj-1) = tj-1/(1 + atj-1) for a small value of a. 
A second option is to set a(tj-1) = atj-1 for a < 1 (usually a > 0.9). In this case, 
one might increase stage lengths as temperatures decrease. For example, consider 
B(mj-1) = bmj- forb > 1, or B@nj_-1) = b + mj-ı for b > 0. A third schedule uses 


Tj-1 
1+ tj- log{1 + r}/@sz,1) 


a(tj—1) = 


A , is the square of the mean objective function cost at the current temperature 


minus the mean squared cost at the current temperature, and r is some small real 
number [1]. Using the temperature schedule t = c/log{1 + i} mentioned above 
based on theory is rarely practical because it is too slow and the determination of c 
is difficult, with excessively large guesses for c further slowing the algorithm. 

Most practitioners require lengthy experimentation to find suitable initial 
parameter values (e.g., Tọ and mọ) and values of the proposed schedules (e.g., a, 
b, and r). While selection of the initial temperature to is usually problem dependent, 
some general guidelines may be given. A useful strategy is to choose a positive Tọ 
value so that exp { Lf(6;) — f(0;)1/t} is close to 1 for any pair of solutions 6; and 
6; in ©. The rationale for this choice is that it provides any point in the parameter 
space with a reasonable chance of being visited in early iterations of the algorithm. 
Similarly, choosing m ; to be large can produce a more accurate solution, but can result 
in long computing times. As a general rule of thumb, larger decreases in temperature 
require longer runs after the decrease. Finally, a good deal of evidence suggests that 
running simulated annealing long at high temperatures is not very useful. In many 
problems, the barriers between local minima are sufficiently modest that jumps be- 
tween them are possible even at fairly low temperatures. Good cooling schedules 
therefore decrease the temperature rapidly at first. 


where s 
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FIGURE 3.4 Results of two simulated annealing minimizations of the regression model AIC 
for Example 3.4. The temperature for the bottom curve is shown by the dotted line and the 
right axis. Only AIC values between —360 and —420 are shown. 


Example 3.4 (Baseball Salaries, Continued) To implement simulated annealing 
for variable selection via the AIC in the baseball salary regression problem introduced 
in Example 3.3, we must establish a neighborhood structure, a proposal distribution, 
and a temperature schedule. The simplest neighborhoods contain |-change neighbors 
generated from the current model by either adding or deleting one predictor. We 
assigned equal probabilities to all candidates in a neighborhood. The cooling schedule 
had 15 stages, with stage lengths of 60 for the first 5 stages, 120 for the next 5, and 
220 for the final 5. Temperatures were decreased according to a(tj-1) = 0.9Tj-1 
after each stage. 

Figure 3.4 shows the values of the AIC for the sequence of candidate solutions 
generated by simulated annealing, for two different choices of tọ. The bottom curve 
corresponds to to = 1. In this case, simulated annealing became stuck at particular 
candidate solutions for distinct periods because the low temperatures allowed little 
tolerance for uphill moves. In the particular realization shown, the algorithm quickly 
found good candidate solutions with low AIC values, where it became stuck fre- 
quently. However, in other cases (e.g., with a very multimodal objective function), 
such stickiness may result in the algorithm becoming trapped in a region far from 
the global minimum. A second run with tọ = 6 (top solid line) yielded considerable 
mixing, with many uphill proposals accepted as moves. The temperature schedule 
for tọ = 1 is shown by the dotted line and the right axis. Both runs exhibited greater 
mixing at higher temperatures. When to = 1, the best model found was first identified 
in the 1274th step and dominated the simulation after that point. This model achieved 
an AIC of —418.95, and matched the best model found using random starts local 
search in Table 3.2. When to = 6, the best model found had an AIC of —417.85. 
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This run was clearly unsuccessful, requiring more iterations, cooler temperatures, 
or both. 


3.3.2 Enhancements 


There are many variations on simulated annealing that purport to improve perfor- 
mance. Here we list a few ideas in an order roughly corresponding to the steps in the 
basic algorithm. 

The simplest way to start simulated annealing is to start once, anywhere. A 
strategy employing multiple random starts would have the dual advantages of poten- 
tially finding a better candidate solution and allowing confirmation of convergence to 
the particular optimum found. Purely random starts could be replaced by a stratified 
set of starting points chosen by strategic preprocessing to be more likely to lead to 
minima than simple random starts. Such strategies must have high payoffs if they are 
to be useful, given simulated annealing’s generally slow convergence. In some cases, 
the extra iterations dedicated to various random starts may be better spent on a single 
long run with longer stage sizes and a slower cooling schedule. 

The solution space, ©, may include constraints on 9. For example, in the genetic 
mapping problem introduced in Example 3.1, must be a permutation of the inte- 
gers 1,..., p when there are p markers. When the process for generating neighbors 
creates solutions that violate these constraints, substantial time may be wasted fixing 
candidates or repeatedly sampling from M(@™) until a valid candidate is found. An 
alternative is to relax the constraints and introduce a penalty into f that penalizes in- 
valid solutions. In this manner, the algorithm can be discouraged from visiting invalid 
solutions without dedicating much time to enforcing constraints. 

In the basic algorithm, the neighborhood definition is static and the proposal 
distribution is the same at each iteration. Sometimes improvements can be obtained 
by adaptively restricting neighborhoods at each iteration. For example, it can be use- 
ful to shrink the size of the neighborhood as time increases to avoid many wasteful 
generations of distant candidates that are very likely to be rejected at such low tem- 
peratures. In other cases, when a penalty function is used in place of constraints, it 
may be useful to allow only neighborhoods composed of solutions that reduce or 
eliminate constraint violations embodied in the current 6. 

It is handy if f can be evaluated quickly for new candidates. We noted pre- 
viously that neighborhood definitions can sometimes enable this, as in the traveling 
salesman problem where a 2-neighborhood strategy led to a simple updating for- 
mula for f. Simple approximation of f is sometimes made, often in a problem- 
specific manner. At least one author suggests monitoring recent iterates and in- 
troducing a penalty term in f that discourages revisiting states like those recently 
visited [201]. 

Next consider the acceptance probability given in step 2 of the canonical sim- 
ulated annealing algorithm in Section 3.3. The expression exp{[ f [fe — f(O*)]/t;} 
is motivated by the Boltzmann distribution from statistical thermodynamics. Other 
acceptance probabilities can be used, however. The linear Taylor series expansion 
of the Boltzmann distribution motivates min ae 1+ (| f(0) — F] y tj) } as a 
possible acceptance probability [352]. To encourage moderate moves away from 
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local minima while preventing excessive small moves, the acceptance probability 
min { 1, exp { [e + fe) — F] / Tj} a where c > 0, has been suggested for cer- 
tain problems [169]. 

In general, there is little evidence that the shape of the cooling schedule (lin- 
ear, polynomial, exponential) matters much, as long as the useful range of temper- 
atures is covered, the range is traversed at roughly the same rate, and sufficient 
time is spent at each temperature (especially the low temperatures) [169]. Reheat- 
ing strategies that allow sporadic, systematic, or interactive temperature increases 
to prevent getting stuck in a local minimum at low temperatures can be effective 
[169, 256, 378]. 

After simulated annealing is complete, one might take the final result of one 
or more runs and polish these with a descent algorithm. In fact, one could refine 
occasional accepted steps in the same way, instead of waiting until simulated annealing 
has terminated. 
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Annealing is not the only natural process successfully exploited as a metaphor to solve 
optimization problems. Genetic algorithms mimic the process of Darwinian natural 
selection. Candidate solutions to a maximization problem are envisioned as biological 
organisms represented by their genetic code. The fitness of an organism is analogous to 
the quality of a candidate solution. Breeding among highly fit organisms provides the 
best opportunity to pass along desirable attributes to future generations, while breeding 
among less fit organisms (and rare genetic mutations) ensures population diversity. 
Over time, the organisms in the population should evolve to become increasingly fit, 
thereby providing a set of increasingly good candidate solutions to the optimization 
problem. The pioneering development of genetic algorithms was done by Holland 
[333]. Other useful references include [17, 138, 200, 262, 464, 531, 533, 661]. 

We revert now to our standard description of optimization as maximization, 
where we seek the maximum of f(0) with respect to 0 € ©. In statistical applications 
of genetic algorithms, f is often a joint log profile likelihood function. 


3.4.1 Definitions and the Canonical Algorithm 


3.4.1.1 Basic Definitions In Example 3.1 above, some genetics terminology 
was introduced. Here we discuss additional terminology needed to study genetic 
algorithms. 

In a genetic algorithm, every candidate solution corresponds to an individual, or 
organism, and every organism is completely described by its genetic code. Individuals 
are assumed to have one chromosome. A chromosome is a sequence of C symbols, 
each of which consists of a single choice from a predetermined alphabet. The most 
basic alphabet is the binary alphabet, {0, 1}, in which case a chromosome of length 
C = 9 might look like 100110001. The C elements of the chromosome are the genes. 
The values that might be stored in a gene (i.e., the elements of the alphabet) are alleles. 
The position of a gene in the chromosome is its locus. 
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The information encoded in an individual’s chromosome is its genotype. We will 
represent a chromosome or its genotype as #. The expression of the genotype in the 
organism itself is its phenotype. For optimization problems, phenotypes are candidate 
solutions and genotypes are encodings: Each genotype, #, encodes a phenotype, 0, 
using the chosen allele alphabet. 

Genetic algorithms are iterative, with iterations indexed by t. Unlike the methods 
previously discussed in this chapter, genetic algorithms track more than one candidate 
solution simultaneously. Let the tth generation consist of a collection of P organisms, 
v, Sess v9. This population of size P at generation ¢ corresponds to a collection of 
candidate solutions, 6, ating 6. 

Darwinian natural selection favors organisms with high fitness. The fitness of 
an organism pO depends on the corresponding f 0P). A high-quality candidate 
solution has a high value of the objective function and a high fitness. As generations 
progress, organisms inherit from their parents bits of genetic code that are associated 
with high fitness if fit parents are predominantly selected for breeding. An offspring 
is a new organism inserted in the (t + 1)th generation to replace a member of the 
tth generation; the offspring’s chromosome is determined from those of two parent 
chromosomes belonging to the tth generation. 

To illustrate some of these ideas, consider a regression model selection problem 
with 9 predictors. Assume that an intercept will be included in any model. The geno- 
type of any model can then be written as a chromosome of length 9. For example, the 
chromosome pO = 100110001 is a genotype corresponding the phenotype of a model 
containing only the fitted parameters for the intercept and predictors 1, 4, 5, and 9. 

Another genotype is 1 = 110100110. Notice that pO and 9 share some 
common genes. A schema i i any subcollection of genes. In this sianiple the two 
chromosomes share the schema 1*01+***, where * represents a wildcard: The 
allele in that locus is ignored. (These two chromosomes also share the schemata 
xkOlkkkkx, 10] «0x, and others.) The significance of schemata is that they encode 
modest bits of genetic information that may be transferred as a unit from parent to 
offspring. If a schema is associated with a phenotypic feature that induces high values 
of the objective function, then the inheritance of this schema by individuals in future 
generations promotes optimization. 


3.4.1.2 Selection Mechanisms and Genetic Operators Breeding drives 
most genetic change. The process by which parents are chosen to produce offspring 
is called the selection mechanism. One simple approach is to select one parent with 
probability proportional to fitness and to select the other parent completely at random. 
Another approach is to select each parent independently with probability propor- 
tional to fitness. Section 3.4.2.2 describes some of the most frequently used selection 
mechanisms. 

After two parents from the tth generation have been selected for breeding, their 
chromosomes are combined in some way that allows schemata from each parent to be 
inherited by their offspring, who become members of generation t + 1. The methods 
for producing offspring chromosomes from chosen parent chromosomes are genetic 
operators. 
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FIGURE 3.5 An example of generation production in a genetic algorithm for a population of 


size P = 4 with chromosomes of length C = 3. Crossovers are illustrated by boxing portions 
of some chromosomes. Mutation is indicated by an underlined gene in the final column. 


A fundamental genetic operator is crossover. One of the simplest crossover 
methods is to select a random position between two adjacent loci and split both parent 
chromosomes at this position. Glue the left chromosome segment from one parent to 
the right segment from the other parent to form an offspring chromosome. The remain- 
ing segments can be combined to form a second offspring or discarded. For example, 
suppose the two parents are 100110001 and 110100110. If the random split point 
is between the third and fourth loci, then the potential offspring are 100100110 and 
110110001. Note that in this example, both offspring inherit the schema | +01 «+. 
Crossover is the key to a genetic algorithm—it allows good features of two candidate 
solutions to be combined. Some more complicated crossover operators are discussed 
in Section 3.4.2.3. 

Mutation is another important genetic operator. Mutation changes an offspring 
chromosome by randomly introducing one or more alleles in loci where those alleles 
are not seen in the corresponding loci of either parent chromosome. For example, if 
crossover produced 100100110 from the parents mentioned above, subsequent mu- 
tation might yield 101100110. Note that the third gene was 0 in both parents and 
therefore crossover alone was guaranteed to retain the schema *:Q::>#:%, Muta- 
tion, however, provides a way to escape this constraint, thereby promoting search 
diversification and providing a way to escape from local maxima. 

Mutation is usually applied after breeding. In the simplest implementation, each 
gene has an independent probability, u, of mutating, and the new allele is chosen 
completely at random from the genetic alphabet. If u is too low, many potentially 
good innovations will be missed; if u is too high, the algorithm’s ability to learn over 
time will be degraded, because excessive random variation will disturb the fitness 
selectivity of parents and the inheritance of desirable schemata. 

To summarize, genetic algorithms proceed by producing generations of in- 
dividuals. The (t + 1)th generation is produced as follows. First the individuals in 
generation ¢ are ranked and selected according to fitness. Then crossover and muta- 
tion are applied to these selected individuals to produce generation t + 1. Figure 3.5 
is a small example of the production of a generation of four individuals with three 
chromosomes per individual and binary chromosome encoding. In generation f, in- 
dividual 110 has the highest fitness among its generation and is chosen twice in the 
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Generation 


FIGURE 3.6 Results of a genetic algorithm for Example 3.5. 


selection stage. In the crossover stage, the selected individuals are paired off so that 
each pair recombines to generate two new individuals. In the mutation stage, a low 
mutation rate is applied. In this example, only one mutation occurs. The completion 
of these steps yields the new generation. 


Example 3.5 (Baseball Salaries, Continued) The results of applying a simple 
genetic algorithm to the variable selection problem for the baseball data introduced 
in Example 3.3 are shown in Figure 3.6. One hundred generations of size P = 20 
were used. Binary inclusion-exclusion alleles were used for each possible predictor, 
yielding chromosomes of length C = 27. The starting generation consisted of purely 
random individuals. A rank-based fitness function was used; see Equation (3.9). One 
parent was selected with probability proportional to this fitness; the other parent 
was selected independently, purely at random. Breeding employed simple crossover. 
A 1% mutation rate was randomly applied independently to each locus. 

The horizontal axis in Figure 3.6 corresponds to generation. The AIC values 
for all 20 individuals in each generation are plotted. The best model found included 
predictors 2, 3, 6, 8, 10, 13, 14, 15, 16, 24, 25, and 26, yielding an AIC of —418.95. 
This matches the best model found using random starts local search (Table 3.2). 
Darwinian survival of the fittest is clearly illustrated in this figure: The 20 random 
starting individuals rapidly coalesce into 3 effective subspecies, with the best of these 
quickly overwhelming the rest and slowly improving thereafter. The best model was 
first found in generation 60. 


3.4.1.3 Allele Alphabets and Genotypic Representation The binary alpha- 
bet for alleles was introduced in the pioneering work of Holland [333] and continues 
to be very prevalent in recent research. The theoretical behavior of the algorithm and 
the relative performance of various genetic operators and other algorithmic variations 
are better understood for binary chromosomes than for other choices. 


3.4 GENETIC ALGORITHMS 79 


For many optimization problems, it is possible to construct a binary encod- 
ing of solutions. For example, consider the univariate optimization of f(@) = 100 — 
(6 — 4)” on the range 0 € [1, 12.999] = [a), a2]. Suppose that we represent a number 
in [a1, a2] as 


àr (=) decimal(b), (3.7) 


where b is a binary number of d digits and the decimal() function converts from base 
2 to base 10. If c decimal places of accuracy are required, then d must be chosen to 
satisfy 


(a — a1)10° < 2f — 1. (3.8) 


In our example, 14 binary digits are required for accuracy to 3 decimal places, and 
b = 01000000000000 maps to 0 = 4.000 using Equation (3.7). 

In some cases, such as the regression model selection problem, a binary-encoded 
chromosome may be very natural. In others, however, the encoding seems forced, 
as it does above. For f(0) = 100 — (6 — 4)?, the chromosome = 01000000000000 
(0 = 4.000) is optimal. However, chromosomes that are genetically close to this, such 
as 10000000000000 (6 = 7.000) and 00000000000000 (6 = 1.000), have phenotypes 
that are not close to 6 = 4.000. On the other hand, the genotype 00111111111111 
has phenotype very close to 4.000 even though the genotype is very different than 
01000000000000. Chromosomes that are similar in genotype may have very different 
phenotypes. Thus, a small mutation may move to a drastically different region of 
solution space, and a crossover may produce offspring whose phenotypes bear little 
resemblance to either parent. To resolve such difficulties, a different encoding scheme 
or modified genetic operators may be required (see Section 3.4.2.3). 

An important alternative to binary representation arises in permutation problems 
of size p, like the traveling salesman problem. In such cases, a natural chromosome is a 
permutation of the integers 1, ..., p, for example, # = 752631948 when p = 9. Since 
such chromosomes must obey the requirement that each integer appear in exactly one 
locus, some changes to standard genetic operators will be required. Strategies for 
dealing with permutation chromosomes are discussed in Section 3.4.2.3. 


3.4.1.4 Initialization, Termination, and Parameter Values Genetic algo- 
rithms are usually initialized with a first generation of purely random individuals. 

The size of the generation, P, affects the speed, convergence behavior, and 
solution quality of the algorithm. Large values of P are to be preferred, if feasible, 
because they provide a more diverse genetic pool from which to generate offspring, 
thereby diversifying the search and discouraging premature convergence. For binary 
encoding of chromosomes, one suggestion is to choose P to satisfy C < P < 2C, 
where C is the chromosome length [8]. For permutation chromosomes, the range 
2C < P < 20C has been suggested [335]. In most real applications, population sizes 
have ranged between 10 and 200 [566], although a review of empirical studies suggests 
that P can often be as small as 30 [531]. 
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Mutation rates are typically very low, in the neighborhood of 1%. Theoretical 
work and empirical studies have supported a rate of 1/C [464], and another inves- 
tigation suggested that the rate should be nearly proportional to 1/(P/C) [571]. 
Nevertheless, a fixed rate independent of P and C is acommon choice. 

The termination criterion for a genetic algorithm is frequently just a maximum 
number of iterations chosen to limit computing time. One might instead consider 
stopping when the genetic diversity within chromosomes in the current generation is 
sufficiently low [17]. 


3.4.2 Variations 


In this section we survey a number of methodological variations that may offer 
improved performance. These include alterations to the fitness function, selection 
mechanism, genetic operators, and other aspects of the basic algorithm. 


3.4.2.1 Fitness Ina canonical genetic algorithm, the fitness of an organism is 
often taken to be the objective function value of its phenotype, perhaps scaled by the 
mean objective function value in its generation. It is tempting to simply equate the 
objective function value f(@) to the fitness because the fittest individual then corre- 
sponds to the maximum likelihood solution. However, directly equating an organism’s 
fitness to the objective function value for its corresponding phenotype is usually naive 
in that other choices yield superior optimization performance. Instead, let #(#) denote 
the value of a fitness function that describes the fitness of a chromosome. The fitness 
function will depend on the objective function f, but will not equal it. This increased 
flexibility can be exploited to enhance search effectiveness. 

A problem seen in some applications of genetic algorithms is excessively fast 
convergence to a poor local optimum. This can occur when a few of the very best 
individuals dominate the breeding and their offspring saturate subsequent generations. 
In this case, each subsequent generation consists of genetically similar individuals 
that lack the genetic diversity needed to produce offspring that might typify other, 
more profitable regions of solution space. This problem is especially troublesome if it 
occurs directly after initialization, when nearly all individuals have very low fitness. 
A few chromosomes that are more fit than the rest can then pull the algorithm to 
an unfavorable local maximum. This problem is analogous to entrapment near an 
uncompetitive local maximum, which is also a concern for the other search methods 
discussed earlier in this chapter. 

Selective pressure must be balanced carefully, however, because genetic algo- 
rithms can be slow to find a very good optimum. It is therefore important to maintain 
firm selective pressure without allowing a few individuals to cause premature conver- 
gence. To do this, the fitness function can be designed to reduce the impact of large 
variations in f. 

A common approach is to ignore the values of f a”) and use only their ranks 
[18, 532, 660]. For example, one could set 


2r; 


ORE 
QW = P(P +1)’ 


(3.9) 
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where r; is the rank of f 0”) among generation t. This strategy gives the chromosome 
corresponding to the median quality candidate a selection probability of 1/P, and the 
best chromosome has probability 2/(P + 1), roughly double that for the median. 
Rank-based methods are attractive in that they retain a key feature of any successful 
genetic algorithm—-selectivity based on relative fitness—while discouraging prema- 
ture convergence and other difficulties caused by the actual form of f, which can be 
somewhat arbitrary [660]. Some less common fitness function formulations involving 
scaling and transforming f are mentioned in [262]. 


3.4.2.2 Selection Mechanisms and Updating Generations Previously, in 
Section 3.4.1.2, we mentioned only simple approaches to selecting parents on the 
basis of fitness. Selecting parents on the basis of fitness ranks (Section 3.4.2.1) is far 
more common than using selection probabilities proportional to fitness. 

Another common approach is tournament selection (204, 263, 264]. In this 
approach, the set of chromosomes in generation ¢ is randomly partitioned into k dis- 
joint subsets of equal size (perhaps with a few remaining chromosomes temporarily 
ignored). The best individual in each group is chosen as a parent. Additional random 
partitionings are carried out until sufficient parents have been generated. Parents are 
then paired randomly for breeding. This approach ensures that the best individual 
will breed P times, the median individual will breed once on average, and the worst 
individual will not breed at all. The approaches of proportional selection, ranking, and 
tournament selection apply increasing selective pressure, in that order. Higher selec- 
tive pressure is generally associated with superior performance, as long as premature 
entrapment in local optima can be avoided [17]. 

Populations can be partially updated. The generation gap, G, is a proportion of 
the generation to be replaced by generated offspring [146]. Thus, G = 1 corresponds 
to a canonical genetic algorithm with distinct, nonoverlapping generations. At the 
other extreme, G = 1/P corresponds to incremental updating of the population one 
offspring at atime. In this case, a steady-state genetic algorithm produces one offspring 
at a time to replace the least fit (or some random relatively unfit) individual [661]. 
Such a process typically exhibits more variance and higher selective pressure than a 
standard approach. 

When G < 1, performance can sometimes be enhanced with a selection mech- 
anism that departs somewhat from the Darwinian analogy. For example, an elitist 
strategy would place an exact copy of the current fittest individual in the next genera- 
tion, thereby ensuring the survival of the best current solution [146]. When G = 1/P, 
each offspring could replace a chromosome randomly selected from those with below- 
average fitness [5]. 

Deterministic selection strategies have been proposed to eliminate sampling 
variability [19, 464]. We see no compelling need to eliminate the randomness inherent 
in the selection mechanism. 

One important consideration when generating or updating a population is 
whether to allow duplicate individuals in the population. Dealing with dupli- 
cate individuals wastes computing resources, and it potentially distorts the par- 
ent selection criterion by giving duplicated chromosomes more chances to produce 
offspring [138]. 
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3.4.2.3 Genetic Operators and Permutation Chromosomes To increase 
genetic mixing, it is possible to choose more than one crossover point. If two 
crossover points are chosen, the gene sequence between them can be swapped be- 
tween parents to create offspring. Such multipoint crossover can improve performance 
[54, 187]. 

Many other approaches for transferring genes from parents to offspring have 
been suggested. For example, each offspring gene could be filled with an allele ran- 
domly selected from the alleles expressed in that position in the parents. In this case, 
the parental origins of adjacent genes could be independent [4, 622] or correlated 
[602], with strength of correlation controlling the degree to which offspring resemble 
a single parent. 

In some problems, a different allele alphabet may be more reasonable. Allele 
alphabets with many more than two elements have been investigated [13, 138, 524, 
534]. For some problems, genetic algorithms using a floating-point alphabet have 
outperformed algorithms using the binary alphabet [138, 346, 463]. Methods known 
as messy genetic algorithms employ variable-length encoding with genetic operators 
that adapt to changing length [265-267]. Gray coding is another alternative encoding 
that is particularly useful for real-valued objective functions that have a bounded 
number of optima [662]. 

When a nonbinary allele alphabet is adopted, modifications to other aspects of 
the genetic algorithm, particularly to the genetic operators, is often necessary and 
even fruitful. Nowhere is this more evident than when permutation chromosomes 
are used. Recall that Section 3.4.1.3 introduced a special chromosome encoding for 
permutation optimization problems. For such problems (like the traveling salesman 
problem), itis natural to write a chromosome as a permutation of the integers 1, ..., n. 
New genetic operators are needed then to ensure that each generation contains only 
valid permutation chromosomes. 

For example, let p = 9, and consider the crossover operator. From two par- 
ent chromosomes 752631948 and 912386754 and a crossover point between the 
second and third loci, standard crossover would produce offspring 752386754 and 
912631948. Both of these are invalid permutation chromosomes, because both contain 
some duplicate alleles. 

A remedy is order crossover [623]. A random collection of loci is chosen, and 
the order in which the alleles in these loci appear in one parent is imposed on the 
same alleles in the other parent to produce one offspring. The roles of the parents can 
be switched to produce a second offspring. This operator attempts to respect relative 
positions of alleles. For example, consider the parents 752631948 and 912386754, 
and suppose that the fourth, sixth, and seventh loci are randomly chosen. In the first 
parent, the alleles in these loci are 6, 1, and 9. We must rearrange the 6, 1, and 9 alleles 
in the second parent to impose this order. The remaining alleles in the second parent 
are «*238«754. Inserting 6, 1, and 9 in this order yields 612389754 as the offspring. 
Reversing the roles of the parents yields a second offspring 352671948. 

Many other crossover operators for permutation chromosomes have been pro- 
posed [135, 136, 138, 268, 464, 492, 587]. Most are focused on the positions of individ- 
ual genes. However, for problems like the traveling salesman problem, such operators 
have the undesirable tendency to destroy links between cities in the parent tours. The 
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TABLE 3.3 Edge tables showing the cities linked to or from each allele in either parent 
for each of the first three steps of edge recombination crossover. Beneath each column 
is the offspring chromosome resulting from each step. 


Step 1 Step 2 Step 3 
City Links City Links City Links 
1 3,9,2 1 3,2 1 3,2 
2 5,6, 1,3 2 5,6, 1,3 2 5,6,1,3 
3 6, 1,2,8 3 6, 1,2,8 3 6,1,2,8 
4 9,8,5 4 8,5 4 Used 
5 7,2,4 5 7,2,4 5 7,2 
6 2,3, 8,7 6 2,3, 8,7 6 2,3, 8,7 
7 8,5,6 7 8, 5,6 7 8, 5,6 
8 4,7, 3,6 8 4,7, 3,6 8 7, 3,6 
9 1,4 9 Used 9 Used 
Qoo k Qk kkk 945 kkk kkk 


desirability of a candidate solution is a direct function of these links. Breaking links 
is effectively an unintentional source of mutation. Edge-recombination crossover has 
been proposed to produce offspring that contain only links present in at least one 
parent [663, 664]. 

We use the traveling salesman problem to explain edge-recombination 
crossover. The operator proceeds through the following steps. 


1. We first construct an edge table that stores all the links that lead into and out 
of each city in either parent. For our two parents, 752631948 and 912386754, 
the result is shown in the leftmost portion of Table 3.3. Note that the number 
of links into and out of each city in either parent will always be at least two 
and no more than four. Also, recall that a tour returns to its starting city, so, for 
example, the first parent justifies listing 7 as a link from 8. 


2. To begin creating an offspring, we choose between the initial cities of the two 
parents. In our example, the choices are cities 7 and 9. If the parents’ initial cities 
have the same number of links, then the choice is made randomly. Otherwise, 
choose the initial city from the parent whose initial city has fewer links. In our 
example, this yields 92:2: >, 


3. We must now link onward from allele 9. From the leftmost column of the edge 
table, we find that allele 9 has two links: 1 and 4. We want to chose between 
these by selecting the city with the fewest links. To do this, we first update 
the edge table by deleting all references to allele 9, yielding the center portion 
of Table 3.3. Since cities 1 and 4 both have two remaining links, we choose 
randomly between 1 and 4. If 4 is the choice, then the offspring is updated to 
OA ck sk ve kok, 


4. There are two possible links onward from city 4: cities 5 and 8. Updating the 
edge table to produce the rightmost portion of Table 3.3, we find that city 5 has 
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the fewest remaining links. Therefore, we choose city 5. The partial offspring 
is NOW 945 xe, 


Continuing this process might yield the offspring 945786312 by the following steps: 
select 7; select 8; select 6; randomly select 3 from the choices of 2 and 3; randomly 
select 1 from the choices of 1 and 2; select 2. 

Note that in each step a city is chosen among those with the fewest links. If, 
instead, links were chosen uniformly at random, cities would be more likely to be left 
without a continuing edge. Since tours are circuital, the preference for a city with few 
links does not introduce any sort of bias in offspring generation. 

An alternative edge assembly strategy has been found to be extremely effective 
in some problems [477]. 

Mutation of permutation chromosomes is not as difficult as crossover. A simple 
mutation operator is to randomly exchange two genes in the chromosome [531]. 
Alternatively, the elements in a short random segment of the chromosome can be 
randomly permuted [138]. 


3.4.3 Initialization and Parameter Values 


Although traditionally a genetic algorithm is initiated with a generation of purely 
random individuals, heuristic approaches to constructing individuals with good or 
diverse fitness have been suggested as an improvement on random starts [138, 531]. 

Equal sizes for subsequent generations are not required. Population fitness usu- 
ally improves very rapidly during the early generations of a genetic algorithm. In 
order to discourage premature convergence and promote search diversity, it may be 
desirable to use a somewhat large generation size P for early generations. If P is 
fixed at too large a value, however, the entire algorithm may be too slow for practical 
use. Once the algorithm has made significant progress toward the optimum, important 
improving moves most often come from high-quality individuals; low-quality indi- 
viduals are increasingly marginalized. Therefore, it has been suggested that P may be 
decreased progressively as iterations continue [677]. However, rank-based selection 
mechanisms are more commonly employed as an effective way to slow convergence. 

It can be also useful to allow a variable mutation rate that is inversely propor- 
tional to the population diversity [531]. This provides a stimulus to promote search 
diversity as generations become less diverse. Several authors suggest other methods 
for allowing the probabilities of mutation and crossover and other parameters of the 
genetic algorithm to vary adaptively over time in manners that may encourage search 
diversity [54, 137, 138, 464]. 


3.4.4 Convergence 


The convergence properties of genetic algorithms are beyond the scope of this chapter, 
but several important ideas are worth mentioning. 

Much of the early analysis about why genetic algorithms work was based on the 
notion of schemata [262, 333]. Such work is based on a canonical genetic algorithm 
with binary chromosome encoding, selection of each parent with probability propor- 
tional to fitness, simple crossover applied every time parents are paired, and mutation 
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randomly applied to each gene independently with probability jz. For this setting, the 
schema theorem provides a lower bound on the expected number of instances of a 
schema in generation t + 1, given that it was present in generation t. 

The schema theorem shows that a short, low-order schema (i.e., one specify- 
ing only a few nearby alleles) will enjoy increased expected representation in the 
next generation if the average fitness of chromosomes containing that schema in the 
generation at time t exceeds the average fitness of all chromosomes in the genera- 
tion. A longer and/or more complex schema will require higher relative fitness to 
have the same expectation. Proponents of schema theory argue that convergence to 
globally competitive candidate solutions can be explained by how genetic algorithms 
simultaneously juxtapose many short low-order schemata of potentially high fitness, 
thereby promoting propagation of advantageous schemata. 

More recently, the schema theorem and convergence arguments based upon it 
have become more controversial. Traditional emphasis on the number of instances of 
a schema that propagate to the next generation and on the average fitness of chromo- 
somes containing that schema is somewhat misguided. What matters far more is which 
particular chromosomes containing that schema are propagated. Further, the schema 
theorem overemphasizes the importance of schemata: in fact it applies equally well 
to any arbitrary subsets of ©. Finally, the notion that genetic algorithms succeed be- 
cause they implicitly simultaneously allocate search effort to many schemata-defined 
regions of © has been substantially discredited [647]. An authoritative exposition of 
the mathematical theory of genetic algorithms is given by Vose [646]. Other helpful 
treatments include [200, 533]. 

Genetic algorithms are not the only optimization strategy that can be motivated 
by analogy to a complex biological system. For example, particle swarm optimization 
also creates and updates a population of candidate solutions [372, 373, 594]. The lo- 
cations of these solutions within the search space evolve through simple rules that can 
be viewed as reflecting cooperation and competition between individuals analogous 
to the movement of birds in a flock. Over a sequence of iterations, each individual 
adjusts its location (i.e., candidate solution) based on its own flying experience and 
those of its companions. 


3.5 TABU ALGORITHMS 


A tabu algorithm is a local search algorithm with a set of additional rules that guide 
the selection of moves in ways that are believed to promote the discovery of a global 
maximum. The approach employs variable neighborhoods: The rules for identifying 
acceptable steps change at each iteration. Detailed studies of tabu methods include 
[254, 255, 257-259]. 

In a standard ascent algorithm, entrapment in a globally uncompetitive local 
maximum is likely, because no downhill moves are allowed. Tabu search allows 
downhill moves when no uphill move can be found in the current neighborhood 
(and possibly in other situations too), thereby potentially escaping entrapment. An 
early form of tabu search, called steepest ascent/mildest descent, moved to the least 
unfavorable neighbor when there was no uphill move [306]. 
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TABLE 3.4 Examples of attributes. The left column gives examples in a generic context. The right 
column gives corresponding attributes in the specific context of 2-change neighborhoods in a 
regression model selection problem. 


Attribute Model Selection Example 

A change in the value of o9. The attribute A,: Whether the ith predictor is added (or 
may be the value from which the move deleted) from the model. 
began, or the value at which it arrived. 

A swap in the values of gP and 6 when A2: Whether the absent variable is exchanged for 
o! 4 oY Js the variable present in the model. 


A change in the value of f resulting from the A3: The reduction in AIC achieved by the move. 
step, f°") — FO). 
The value ¢(6*)) of some other strategically A4: The number of predictors in the new model. 
chosen function g. 
A change in the value of g resulting from the As: A change to a different variable selection 
step, ¢(0"T) — (0). criterion such as Mallows’s C, [435] or the 
adjusted R? [483]. 


If a downhill step is chosen, care must be taken to ensure that the next step (or a 
future one) does not simply reverse the downhill move. Such cycling would eliminate 
the potential long-term benefit of the downhill move. To prevent such cycling, certain 
moves are temporarily forbidden, or made tabu, based on the recent history of the 
algorithm. 

There are four general types of rules added to local search by tabu search 
methods. The first is to make certain potential moves temporarily tabu. The others 
involve aspiration to better solutions, intensification of search in promising areas of 
solution space, and diversification of search candidates to promote broader exploration 
of the solution space. These terms will be defined after we discuss tabus. 


3.5.1 Basic Definitions 


Tabu search is an iterative algorithm initiated at time t = O with a candidate solution 
0). At the rth iteration, a new candidate solution is selected from a neighborhood 
of 0. This candidate becomes 0°+)). Let H denote the history of the algorithm 
through time f. It suffices for H to be a selective history, remembering only certain 
matters necessary for the future operation of the algorithm. 

Unlike simple local search, a tabu algorithm generates a neighborhood of 
the current candidate solution that depends on the search history; denote this by 
NO, HO). Furthermore, the identification of the preferred 0°*) in WO, HO) 
may depend not only on f but also on the search history. Thus, we may assess neigh- 
bors using an augmented objective function, fyo. 

A single step from 6 to 0+) can be characterized by many attributes. 
Attributes will be used to describe moves or types of moves that will be forbid- 
den, encouraged, or discouraged in future iterations of the algorithm. Examples of 
attributes are given in the left column of Table 3.4. Such attributes are not unique to 
tabu search; indeed, they can be used to characterize moves from any local search. 
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However, tabu search explicitly adapts the current neighborhood according to the 
attributes of recent moves. 

The attributes in Table 3.4 can be illustrated by considering a regression model 
selection problem. Suppose 0 = | if the ith predictor is included in the model 
at time ¢, and 0 otherwise. Suppose that 2-change neighborhoods consist of all 
models to which two variables separately have each been added or deleted from the 
current model. The right column of Table 3.4 gives one example of each generic 
attribute listed, in the context of these 2-change neighborhoods in the regression 
model selection problem from Example 3.2. These examples are labeled A, through 
As. Many other effective attributes can be identified from the context of specific 
optimization problems. 

Denote the ath attribute as Aa. Note that the complement (i.e., negation) of an 
attribute is also an attribute, so if A, corresponds to swapping the values of oP and 
gt RD, then A, corresponds to not making that swap. 

As the algorithm progresses, the attributes of the tth move will vary with t, and 
the quality of the candidate solution will also vary. Future moves can be guided by 
the history of past moves, their objective function values, and their attributes. The 
recency of an attribute is the number of steps that have passed since a move most 
recently had that attribute. Let R (Aa, H @) = Oif the ath attribute is expressed in the 
move yielding 6, let R (Aa, H @) = | if it is most recently expressed in the move 


yielding 0°—), and so forth. 


3.5.2 The Tabu List 


When considering a move from 6, we compute the increase in the objective function 
achieved for each neighbor of 6, Ordinarily, the neighbor that provides the greatest 
increase would be adopted as 6+). This corresponds to the steepest ascent. 

Suppose, however, that no neighbor of o0 yields an increased objective func- 
tion. Then 6+ is ordinarily chosen to be the neighbor that provides the smallest 
decrease. This is the mildest descent. 

If only these two rules were used for search, the algorithm would quickly 
become trapped and converge to a local maximum. After one move of mildest descent, 
the next move would return to the hilltop just departed. Cycling would ensue. 

To avoid such cycling, a tabu list of temporarily forbidden moves is incorpo- 
rated in the algorithm. Each time a move with attribute A, is taken, A, is put on 
a tabu list for t iterations. When R (Aa, HO) first equals t, the tabu expires and 
Aa is removed from the tabu list. Thus, moves with attributes on the tabu list are 
effectively excluded from the current neighborhood. The modified neighborhood is 
denoted 


NO, HO) = fo : 0€ MOC) and no attribute of 6 is currently tabu } . (3.10) 


This prevents undoing the change for t iterations, thereby discouraging cycling. By 
the time that the tabu has expired, enough other aspects of the candidate solution 
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should have changed that reversing the move may no longer be counterproductive. 
Note that the tabu list is a list of attributes, not moves, so a single tabu attribute may 
forbid entire classes of moves. 

The tabu tenure, t, is the number of iterations over which an attribute is tabu. 
This can be a fixed number or it may vary, systematically or randomly, perhaps based 
on features of the attribute. For a given problem, a well-chosen tabu tenure will 
be long enough to prevent cycling and short enough to prevent the deterioration of 
candidate solution quality that occurs when too many moves are forbidden. Fixed 
tabu tenures between 7 and 20, or between 0.5,/p and 2./p, where p is the size of 
the problem, have been suggested for various problem types [257]. Tabu tenures that 
vary dynamically seem more effective in many problems [259]. Also, it will often be 
important to use different tenures for different attributes. If an attribute contributes 
tabu restrictions for a wide variety of moves, the corresponding tabu tenure should 
be short to ensure that future choices are not limited. 


Example 3.6 (Genetic Mapping, Continued) We illustrate some uses of tabus, 
using the gene mapping problem introduced in Example 3.1. 

First, consider monitoring the swap attribute. Suppose that A, is the swap 
attribute corresponding to exchanging two particular loci along the chromosome. 
When a move A, is taken, it is counterproductive to immediately undo the swap, so 
‘Ag is placed on the tabu list. Search progresses only among moves that do not reverse 
recent swaps. Such a tabu promotes search diversity by avoiding quick returns to 
recently searched areas. 

Second, consider the attribute identifying the locus label 6; for which d (0j, Oj+1) 
is smallest in the new move. In other words, this attribute identifies the two loci in 
the new chromosome that are nearest each other. If the complement of this attribute 
is put on the tabu list, any move to a chromosome for which other loci are closer will 
be forbidden moves for t iterations. Such a tabu promotes search intensity among 
genetic maps for which loci 6; and 0; are closest. 

Sometimes, it may be reasonable to place the attribute itself, rather than its 
complement, on the tabu list. For example, let h(@) compute the mean dj, Oj+1) 
between adjacent loci in a chromosome ordered by 0. Let Ag be the attribute indicat- 
ing excessive change of the mean conditional MLE map distance, so Ag equals 1 if 
nor) — no) > c and 0 otherwise, for some fixed threshold c. If a move with 
mean change greater than c is taken, we may place Ag itself on the tabu list for t 
iterations. This prevents any other drastic mean changes for a period of time, allowing 
better exploration of the newly entered region of solution space before moving 
far away. 


3.5.3 Aspiration Criteria 


Sometimes, choosing not to move to a nearby candidate solution because the move is 
currently tabu can be a poor decision. In these cases, we need a mechanism to override 
the tabu list. Such a mechanism is called an aspiration criterion. 
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A simple and popular aspiration criterion is to permit a tabu move if it provides 
a higher value of the objective function than has been found in any iteration so far. 
Clearly it makes no sense to overlook the best solution found so far, even if it is 
currently tabu. One can easily envision scenarios where this aspiration criterion is 
useful. For example, suppose that a swap of two components of @ is on the tabu 
list and the candidate solutions at each iteration recently have drifted away from the 
region of solution space being explored when the tabu began. The search will now 
be in a new region of solution space where it is quite possible that reversing the tabu 
swap would lead to a drastic increase in the objective function. 

Another interesting option is aspiration by influence. A move or attribute is 
influential if it is associated with a large change in the value of the objective function. 
There are many ways to make this idea concrete [257]. To avoid unnecessary detail 
about numerous specific possibilities, let us simply denote the influence of the ath 
attribute as 7 (Aa, H ©) for a move yielding 0. In many combinatorial problems, 
there are a lot of neighboring moves that cause only small incremental changes to the 
value of the objective function, while there are a few moves that cause major shifts. 
Knowing the attributes of such moves can help guide search. Aspiration by influence 
overrides the tabu on reversing a low-influence move if a high-influence move is 
made prior to the reversal. The rationale for this is that the recent high-influence step 
may have moved the search to a new region of the solution space where further local 
exploration is useful. The reversal of the low-influence move will probably not induce 
cycling, since the intervening high-influence move likely shifted scrutiny to a portion 
of solution space more distant than what could be reached by the low-influence 
reversal. 

Aspiration criteria can also be used to encourage moves that are not tabu. For 
example, when low-influence moves appear to provide only negligible improvement 
in the objective function, they can be downweighted and high-influence moves can 
be given preference. There are several ways to do this; one approach is to incorporate 
in fyo) either a penalty or an incentive term that depends on the relative influence of 
candidate moves. 


3.5.4 Diversification 


An important component of any search is to ensure that the search is broad enough. 
Rules based on how often attributes are observed during search can be used to increase 
the diversity of candidate solutions examined during tabu search. 

The frequency of an attribute records the number of moves that manifested that 
attribute since the search began. Let C(A,, H ) represent the count of occurrences 
of the ath attribute thus far. Then F(A,, H®) represents a frequency function that can 
be used to penalize moves that are repeated too frequently. The most direct definition 
is F(Ag, H®) = C(Ag, H)/t, but the denominator may be replaced by the sum, 
the maximum, or the average of the counts of occurrences of various attributes. 

Suppose the frequency of each attribute is recorded, either over the entire history 
or over the most recent w moves. Note that this frequency may be one of two types, 
depending on the attribute considered. If the attribute corresponds to some feature 
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of 6”, then the frequency measures how often that feature is seen in candidate solu- 
tions considered during search. Such frequencies are termed residence frequencies. 
If, alternatively, the attribute corresponds to some change induced by moving from 
one candidate solution to another, then the frequency is a transition frequency. For 
example, in the regression model selection problem introduced in Example 3.2, the 
attribute noting the inclusion of the predictor x; in the model would have a residence 
frequency. The attribute that signaled when a move reduced the AIC would have a 
transition frequency. 

If attribute A, has a high residence frequency and the history of the most recent 
w moves covers nearly optimal regions of solution space, this may suggest that Aa 
is associated with high-quality solutions. On the other hand, if the recent history 
reflects the search getting stuck in a low-quality region of solution space, then a high 
residence frequency may suggest that the attribute is associated with bad solutions. 
Usually, w > t is an intermediate or long-term memory parameter that allows the 
accumulation of additional historical information to diversify future search. 

If attribute A, has ahigh transition frequency, this attribute may be what has been 
termed a crack filler. Such an attribute may be frequently visited during the search 
in order to fine-tune good solutions but rarely offers fundamental improvement or 
change [257]. In this case, the attribute has low influence. 

A direct approach employing frequency to increase search diversification is to 
incorporate a penalty or incentive function in fyw. The choice 


FO) if FO) = FO), 
FOO) — cF (Ag, HO) if FO) < FO) 


fuoO?) = (3.11) 


with c > 0 has been suggested [566]. If all nontabu moves are downhill, then this 
approach discourages moves that have the high-frequency attribute A,. An analogous 
strategy can be crafted to diversify the selection of uphill moves. 

Instead of incorporating a penalty or incentive in the objective function, it is 
possible to employ a notion of graduated tabu status, where an attribute may be only 
partially tabu. One way to create a tabu status that varies by degrees is to invoke 
probabilistic tabu decisions: An attribute can be assigned a probability of being tabu, 
where the probability is adjusted according to various factors, including the tabu 
tenure [257]. 


3.5.5 Intensification 


In some searches it may be useful to intensify the search effort in particular areas of 
solution space. Frequencies can also be used to guide such intensification. Suppose 
that the frequencies of attributes are tabulated over the most recent v moves, and a 
corresponding record of objective function values is kept. By examining these data, 
key attributes shared by good candidate solutions can be identified. Then moves that 
retain such features can be rewarded and moves that remove such features can be 
penalized through fyo. The time span v > t parameterizes the length of a long-term 
memory to enable search intensification in promising areas of solution space. 


3.5 TABU ALGORITHMS 91 


3.5.6 Comprehensive Tabu Algorithm 


Below we summarize a fairly general tabu algorithm that incorporates many of the 
features described above. After initialization and identification of a list of problem- 
specific attributes, the algorithm proceeds as follows: 


1. Determine an augmented objective function fyo that depends on f and perhaps 
on 
a. frequency-based penalties or incentives to promote diversification, and/or 


b. frequency-based penalties or incentives to promote intensification. 


2. Identify neighbors of 0, namely the members of MOO). 

3. Rank the neighbors in decreasing order of improvement, as evaluated 
by fpo. 

. Select the highest ranking neighbor. 

. Is this neighbor currently on the tabu list? If not, go to step 8. 


. Does this neighbor pass an aspiration criterion? If so, go to step 8. 


NH wm RA 


. If all neighbors of 6 have been considered and none have been adopted as 
OCHD, then stop. Otherwise, select the next most high-ranking neighbor and go 
to step 5. 


8. Adopt this solution as 0°+)), 


9. Update the tabu list by creating new tabus based on the current move and by 
deleting tabus whose tenures have expired. 


10. Has a stopping criterion been met? If so, stop. Otherwise, increment ¢ and go 
to step 1. 


It is sensible to stop when a maximum number of iterations has been reached, and 
then to take the best candidate solution yet found as the final result. Search effort can 
be split among a collection of random starts rather than devoting all resources to one 
run from a single start. By casting tabu search in a Markov chain framework, it is 
possible to obtain results on the limiting convergence of the approach [191]. 


Example 3.7 (Baseball Salaries, Continued) A simple tabu search was applied to 
the variable selection problem for regression modeling of the baseball data introduced 
in Example 3.3. Only attributes signaling the presence or absence of each predictor 
were monitored. Moves that would reverse the inclusion or removal of a predictor 
were made tabu for t = 5 moves, and the algorithm was run for 75 moves from a 
random start. The aspiration criterion permitted an otherwise tabu move if it yielded 
an objective function value above the best previously seen. 

Figure 3.7 shows the values of the AIC for the sequence of candidate solutions 
generated by this tabu search. The AIC was quickly improved, and an optimum value 
of —418.95, derived from the model using predictors 2, 3, 6, 8, 10, 13, 14, 15, 16, 24, 
25, and 26, was found on two occasions: iterations 29 and 43. This solution matches 
the best model found using random starts local search (Table 3.2). 
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FIGURE 3.7 Results of tabu search for Example 3.7. Only AIC values between —360 and 
—420 are shown. 


PROBLEMS 


The baseball data introduced in Section 3.3 are available from the website for this book. 
Problems 3.1-3.4 explore the implications of various algorithm configurations. Treat these 
problems in the spirit of experiments, trying to identify settings where interesting differences 
can be observed. Increase the run lengths from those used above to suit the speed of your 
computer, and limit the total number of objective function evaluations in every run (effectively 
the search effort) to a fixed number so that different algorithms and configurations can be 
compared fairly. Summarize your comparisons and conclusions. Supplement your comments 
with graphs to illustrate key points. 


3.1. Implement a random starts local search algorithm for minimizing the AIC for the 
baseball salary regression problem. Model your algorithm after Example 3.3. 


a. Change the move strategy from steepest descent to immediate adoption of the first 
randomly selected downhill neighbor. 


b. Change the algorithm to employ 2-neighborhoods, and compare the results with 
those of previous runs. 


3.2. Implement a tabu algorithm for minimizing the AIC for the baseball salary regression 
problem. Model your algorithm after Example 3.7. 


a. Compare the effect of using different tabu tenures. 


b. Monitor changes in AIC from one move to the next. Define a new attribute that 
signals when the AIC change exceeds some value. Allow this attribute to be included 
on the tabu list, to promote search diversity. 


c. Implement aspiration by influence, overriding the tabu of reversing a low-influence 
move if a high-influence move is made prior to the reversal. Measure influence with 
changes in R?. 
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3.3. Implement simulated annealing for minimizing the AIC for the baseball salary regres- 
sion problem. Model your algorithm on Example 3.4. 


a. 


b. 


Compare the effects of different cooling schedules (different temperatures and dif- 
ferent durations at each temperature). 


Compare the effect of a proposal distribution that is discrete uniform over 
2-neighborhoods versus one that is discrete uniform over 3-neighborhoods. 


3.4. Implement a genetic algorithm for minimizing the AIC for the baseball salary regression 
problem. Model your algorithm on Example 3.5. 


a. 
b. 


Compare the effects of using different mutation rates. 


Compare the effects of using different generation sizes. 


c. Instead of the selection mechanism used in Example 3.5, try the following three 


mechanisms: 
i. Independent selection of one parent with probability proportional to fitness and 
the other completely at random 
ii. Independent selection of each parent with probability proportional to fitness 
iii. Tournament selection with P/5 strata, and/or another number of strata that you 


prefer 


To implement some of these approaches, you may need to scale the fitness function. 
For example, consider the scaled fitness functions z given by 


oH) = af) +b, (3.12) 

oH) = FOP) — (F - zs), (3.13) 
or 

gH) = £)’, (3.14) 


where a and bare chosen so that the mean fitness equals the mean objective function 
value and the maximum fitness is a user-chosen c times greater than the mean fitness, 
Fis the mean and s is the standard deviation of the unscaled objective function values 
in the current generation, z is a number generally chosen between 1 and 3, and v 
is a number slightly larger than 1. Some scalings can sometimes produce negative 
values for wv. In such situations, we may apply the transformation 


HW) +d if dH) +d > 0, 
$new(B;”) = {a k eee (3.15) 


otherwise, 


where d” is the absolute value of the fitness of the worst chromosome in generation 
t, in the last k generations for some k, or in all preceding generations. Each of these 
scaling approaches has the capacity to dampen the variation in f, thereby retaining 
within-generation diversity and increasing the potential to find the global optimum. 


Compare and comment on the results for your chosen methods. 


. Apply a steady-state genetic algorithm, with the generation gap G = 1/P. Compare 


with the canonical option of distinct, nonoverlapping generations. 
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FIGURE 3.8 Chromosomes for Problem 3.5. Simulated data on 12 loci are available for 
100 individuals. For each locus, the source chromosome from the heterozygous parent is 
encoded in black or white, analogously to Figure 3.1 in Example 3.1. The left panel shows 
the data arranged according to the true locus ordering, whereas the right panel shows the 
data arranged by locus label as they would be recorded during data collection. 


e. 


Implement the following crossover approach, termed uniform crossover [622]: Each 
locus in the offspring is filled with an allele independently selected at random from 
the alleles expressed in that position in the parents. 


3.5. Consider the genetic mapping example introduced in Example 3.1. Figure 3.8 shows 
some data for 100 simulated data sequences for a chromosome of length 12. The left 
panel of this figure shows the data under the true genetic map ordering, and the right 
panel shows the actual data, with the ordering unknown to the analyst. The data are 
available from the website for this book. 


a. 


Apply a random starts local search approach to estimate the genetic map (i.e., the 
ordering and the genetic distances). Let neighborhoods consist of 20 orderings 
that differ from the current ordering by randomly swapping the placement of two 
alleles. Move to the best candidate in the neighborhood, thereby taking a random 
descent step. Begin with a small number of starts of limited length, to gauge the 
computational difficulty of the problem; then report the best results you obtained 
within reasonable limits on the computational burden. Comment on your results, 
the performance of the algorithm, and ideas for improved search. [Hint: Note that 
the orderings (6;,,0;,,..., Oj) and (0;,,, Oji -- -, Oj) represent identical chromo- 
somes read from either end.] 


. Apply an algorithm for random starts local search via steepest descent to estimate 


the genetic map. Comment on your results and the performance of the algorithm. 
This problem is computationally demanding and may require a fast computer. 


3.6. Consider the genetic mapping data described in Problem 3.5. 


a. 


Apply a genetic algorithm to estimate the genetic map (i.e., the ordering and the 
genetic distances). Use the order crossover method. Begin with a small run to 
gauge the computational difficulty of the problem, then report your results for a run 


3.7. 


3.8. 
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using reasonable limits on the computational burden. Comment on your results, the 
performance of the algorithm, and ideas for improved search. 


b. Compare the speed of fitness improvements achieved with the order crossover and 
the edge-recombination crossover strategies. 


c. Attempt any other heuristic search method for these data. Describe your implemen- 
tation, its speed, and the results. 


The website for this book also includes a second synthetic dataset for a genetic mapping 
problem. For these data, there are 30 chromosomes. Attempt one or more heuristic 
search methods for these data. Describe your implementation, the results, and the nature 
of any problems you encounter. The true ordering used to simulate the data is also given 
for this dataset. Although the true ordering may not be the MLE, how close is your best 
ordering to the true ordering? How much larger is this problem than the one examined 
in the previous problem? 


Thirteen chemical measurements were carried out on each of 178 wines from three 
regions of Italy [53]. These data are available from the website for this book. Using 
one or more heuristic search methods from this chapter, partition the wines into three 
groups for which the total of the within-group sum of squares is minimal. Comment on 
your work and the results. This is a search problem of size 3” where p = 178. If you 
have access to standard cluster analysis routines, check your results using a standard 
method like that of Hartigan and Wong [317]. 


CHAPTER í 


EM OPTIMIZATION METHODS 


The expectation—maximization (EM) algorithm is an iterative optimization strategy 
motivated by a notion of missingness and by consideration of the conditional distribu- 
tion of what is missing given what is observed. The strategy’s statistical foundations 
and effectiveness in a variety of statistical problems were shown in a seminal paper 
by Dempster, Laird, and Rubin [150]. Other references on EM and related methods 
include [409, 413, 449, 456, 625]. The popularity of the EM algorithm stems from 
how simple it can be to implement and how reliably it can find the global optimum 
through stable, uphill steps. 

In a frequentist setting, we may conceive of observed data generated from 
random variables X along with missing or unobserved data from random variables Z. 
We envision complete data generated from Y = (X, Z). Given observed data x, we 
wish to maximize a likelihood L(6|x). Often it will be difficult to work with this 
likelihood and easier to work with the densities of Y|6@ and Z|(x, 0). The EM algorithm 
sidesteps direct consideration of L(6@|x) by working with these easier densities. 

In a Bayesian application, interest often focuses on estimating the mode of a 
posterior distribution f(@|x). Again, optimization can sometimes be simplified by 
consideration of unobserved random variables w in addition to the parameters of 
interest, 6. 

The missing data may not truly be missing: They may be only a conceptual ploy 
that simplifies the problem. In this case, Z is often referred to as latent. It may seem 
counterintuitive that optimization sometimes can be simplified by introducing this 
new element into the problem. However, examples in this chapter and its references 
illustrate the potential benefit of this approach. In some cases, the analyst must draw 
upon his or her creativity and cleverness to invent effective latent variables; in other 
cases, there is a natural choice. 


4.1 MISSING DATA, MARGINALIZATION, AND NOTATION 


Whether Z is considered latent or missing, it may be viewed as having been re- 
moved from the complete Y through the application of some many-to-fewer map- 
ping, X = M(Y). Let fx(x|@) and fy(y|@) denote the densities of the observed 
data and the complete data, respectively. The latent- or missing-data assumption 
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amounts to a marginalization model in which we observe X having density fx(x|0@) = 
J Iy:My)=x} fy(yl0) dy. Note that the conditional density of the missing data given the 
observed data is fz)x(z|x, 0) = fy(y|0)/fx(x!@). 

In Bayesian applications focusing on the posterior density for parameters of 
interest, 0, there are two manners in which we may consider the posterior to represent 
a marginalization of a broader problem. First, it may be sensible to view the likelihood 
L(@|x) as a marginalization of the complete-data likelihood L(@|y) = L(6|x, z). In this 
case the missing data are z, and we use the same sort of notation as above. Second, we 
may consider there to be missing parameters yw, whose inclusion simplifies Bayesian 
calculations even though W is of no interest itself. Fortunately, under the Bayesian 
paradigm there is no practical distinction between these two cases. Since Z and w are 
both missing random quantities, it matters little whether we use notation that suggests 
the missing variables to be unobserved data or parameters. In cases where we adopt 
the frequentist notation, the reader may replace the likelihood and Z by the posterior 
and W, respectively, to consider the Bayesian point of view. 

In the literature about EM, it is traditional to adopt notation that reverses the 
roles of X and Y compared to our usage. We diverge from tradition, using X = x to 
represent observed data as everywhere else in this book. 


4.2 THEEM ALGORITHM 


The EM algorithm iteratively seeks to maximize L(6|x) with respect to 6. Let 0 
denote the estimated maximizer at iteration ¢, for t = 0, 1, .... Define 0(6|6) to be 
the expectation of the joint log likelihood for the complete data, conditional on the 
observed data X = x. Namely, 


21016) = E {log LØIY) | x, 0} (4.1) 
= E {log FIO) | x, et (4.2) 
= J [log fy(y10)] fzix(zlx, 0) dz, (4.3) 


where (4.3) emphasizes that Z is the only random part of Y once we are given X = x. 
EM is initiated from 6 then alternates between two steps: E for expectation 
and M for maximization. The algorithm is summarized as: 


1. E step: Compute Q(0|0®). 
2. M step: Maximize Q(6|6) with respect to 0. Set 0°*) equal to the maximizer 
of Q. 
3. Return to the E step unless a stopping criterion has been met. 
Stopping criteria for optimization problems are discussed in Chapter 2. In the 


present case, such criteria are usually built upon (60°F) — @)T(@“F+) — 9%) or 
|o"*) a) =. 20 |6)). 
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Example 4.1 (Simple Exponential Density) To understand the EM notation, con- 
sider a trivial example where Y1, Y2 ~ i.i.d. Exp(@). Suppose yı = 5 is observed but 
the value y2 is missing. The complete-data log likelihood function is log L(@|y) = 
log fy(y|@) = 2 log{0} — Oy; — Oy2. Taking the conditional expectation of log L(0|Y) 
yields O(6|0) = 2 log{6} — 56 — 6/0, since E{Y¥2|y1, 0} = E{¥2|0} = 1/6 
follows from independence. The maximizer of Q(6|0™) with respect to @ is easily 
found to be the root of 2/6 — 5 — 1/6 = 0. Solving for @ provides the updating 
equation 6¢+) = 26 /(56 + 1). Note here that the E step and M step do not need 
to be rederived at each iteration: Iterative application of the updating formula starting 
from some initial value provides estimates that converge to Î = 0.2. 

This example is not realistic. The maximum likelihood estimate of 0 from the 
observed data can be determined from elementary analytic methods without reliance 
on any fancy numerical optimization strategy like EM. More importantly, we will 
learn that taking the required expectation is tricker in real applications, because one 
needs to know the conditional distribution of the complete data given the missing 
data. 


Example 4.2 (Peppered Moths) The peppered moth, Biston betularia, presents 
a fascinating story of evolution and industrial pollution [276]. The coloring of these 
moths is believed to be determined by a single gene with three possible alleles, which 
we denote C, I, and T. Of these, C is dominant to I, and T is recessive to I. Thus the 
genotypes CC, CI, and CT result in the carbonaria phenotype, which exhibits solid 
black coloring. The genotype TT results in the typica phenotype, which exhibits light- 
colored patterned wings. The genotypes II and IT produce an intermediate phenotype 
called insularia, which varies widely in appearance but is generally mottled with 
intermediate color. Thus, there are six possible genotypes, but only three phenotypes 
are measurable in field work. 

In the United Kingdom and North America, the carbonaria phenotype nearly 
replaced the paler phenotypes in areas affected by coal-fired industries. This change 
in allele frequencies in the population is cited as an instance where we may observe 
microevolution occurring on a human time scale. The theory (supported by experi- 
ments) is that “differential predation by birds on moths that are variously conspicuous 
against backgrounds of different reflectance” (p. 88) induces selectivity that favors the 
carbonaria phenotype in times and regions where sooty, polluted conditions reduce 
the reflectance of the surface of tree bark on which the moths rest [276]. Not surpris- 
ingly, when improved environmental standards reduced pollution, the prevalence of 
the lighter-colored phenotypes increased and that of carbonaria plummeted. 

Thus, it is of interest to monitor the allele frequencies of C, I, and T over time 
to provide insight on microevolutionary processes. Further, trends in these frequen- 
cies also provide an interesting biological marker to monitor air quality. Within a 
sufficiently short time period, an approximate model for allele frequencies can be 
built from the Hardy—Weinberg principle that each genotype frequency in a popula- 
tion in Hardy—Weinberg equilibrium should equal the product of the corresponding 
allele frequencies, or double that amount when the two alleles differ (to account for 
uncertainty in the parental source) [15, 316]. Thus, if the allele frequencies in the 
population are pc, py, and py, then the genotype frequencies should be pè, 2PcPi» 
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2PcPr ae 2pıpr, and p for genotypes CC, CI, CT, II, IT, and TT, respectively. 
Note that pe + py + pr = 1. 

Suppose we capture n moths, of which there are nç, ny, and ny of the carbonaria, 
insularia, and typica phenotypes, respectively. Thus, n = nc + ny + ny. Since each 
moth has two alleles in the gene in question, there are 2n total alleles in the sample. 
If we knew the genotype of each moth rather than merely its phenotype, we could 
generate genotype counts ncc, Ncr Ncr: nmp rr, and nyp, from which allele frequen- 
cies could easily be tabulated. For example, each moth with genotype CI contributes 
one C allele and one I allele, whereas a II moth contributes two I alleles. Such allele 
counts would immediately provide estimates of pc, py, and pr. It is far less clear how 
to estimate the allele frequencies from the phenotype counts alone. 

In the EM notation, the observed data are x = (nç, ny, nņ) and the complete 
data are y = (ncc, ncr ner» Ny Nyy, Arr). The mapping from the complete data to the 
observed data is x = M(y) = (Nog + ncy + Ncr My + Nyy, Arr). We wish to estimate 
the allele probabilities, pc, py, and py. Since pp = 1 — pc — py, the parameter vector 
for this problem is p = (pc, py), but for notational brevity we often refer to py in what 
follows. 

The complete data log likelihood function is multinomial: 


log fy(ylp) =ncc log{ pe} + ney log{2 pe py} + nex log{2pepr} 
+ ny log{ pz} + ny log{2p, py} + ner log{ pr} 


n 
+ log : (4.4) 
ncc Na Per "u Mr "TT 


The complete data are not all observed. Let Y = (Nog, Ncr, Ner, Nm Nis rr), Since 


we know Nyy = ^rr but the other frequencies are not directly observed. To calculate 


QO(p|p™), notice that conditional on Nc and a parameter vector p® = ( pe, pr”), 


the latent counts for the three carbonaria genotypes have a three-cell multinomial 


distribution with count parameter nç and cell probabilities proportional to ( poy s 


2p O NA and 2p® pk “A similar result holds for the two insularia cells. Thus the 


atpected values of fhe first five random parts of (4.4) are 


(t) (t) n spe? 
E{Ncclnc np nr, P} = nec = i (4.5) 
(PO) + 20 py? +2p pr 
() 
(es. 2Nc Pe Pi 
E{Nalnc, np np, pO} =n (4.6) 
CIC I T Cl — (pO)? + 2p© p O 4 2p8pť (t) 
pepe 
O) 2ncpè Pr 
E{Ncrlnc: ny nr, p?} =n À (4.7) 
cT T= (poy + 2p pí mG + 2p pl () 
O2 
(t) ny (py ) 
E{Nylnc, np np, pO} =n ee (4.8) 
1 I ( pl?) + 2p? pl (t)’ 
2n, pO pO 
E{Nrlnc nnr p”) =n i = p (4.9) 


(pl?) + 2p p£ (t)* 
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Finally, we know ny; = ny, where ny is observed. The multinomial coefficient in the 
likelihood has a conditional expectation, say k(n, ny, Ny, p”), that does not depend 
on p. Thus, we have found 


O(plp) = nk. log{ p2} + n@ log{2 pep} 


+ nef log(2pepr} + nw logt pz} 
+ nip logt2pypp} + mrp logt pp} + kes np nr P). (4-10) 
Recalling pr = 1 — pc — p; and differentiating with respect to pc and p; yields 


dQ(pip) _ 2a +n tner _ Ankh tner + mre 


f (4.11) 
dpe Pc 1—pe- Pp 

dQ@ip) _ 2ny’ + nie tne 2ni + ner + nie (410) 
dp Pi LP pe pj 


Setting these derivatives equal to zero and solving for pç and p; completes the M step, 
yielding 


t t t 
G+) Dice =F net + ney 


Pe = : (4.13) 
2n 
(t) (t) (t) 
fi 2 
pe = E t tre, (4.14) 
2n 
(t) (t) (t) 
4 2 + 
p D = AANT eter + Nr (4.15) 


2n 


where the final expression is derived from the constraint that the probabilities sum to 
one. If the fth latent counts were true, the number of carbonaria alleles in the sample 
would be 2n) + ne + no. There are 2n total alleles in the sample. Thus, the EM 
update consists of setting the elements of p“+!) equal to the phenotypic frequencies 
that would result from the tth latent genotype counts. 

Suppose the observed phenotype counts are ng = 85, n; = 196, and ny = 341. 
Table 4.1 shows how the EM algorithm converges to the MLEs, roughly po = 
0.07084, p; = 0.18874, and y = 0.74043. Finding a precise estimate of p, is slower 
than for pc, since the likelihood is flatter along the p; coordinate. 

The last three columns of Table 4.1 show convergence diagnostics. A relative 
convergence criterion, 


() |p F po) | 
Ip Py? 


summarizes the total amount of relative change in p” from one iteration to 


the next, where ||z|| = (z'z)!/*. For illustrative purposes, we also include DË = 


( pe — Pc)/( pe? — Pc) and the analogous quantity D®. These ratios quickly con- 
verge to constants, confirming that the EM rate of convergence is linear as defined 


by (2.19). 


(4.16) 
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TABLE 4.1 EM results for peppered moth example. The diagnostic quantities RO, DË, and DP are 
defined in the text. 


t pe p? RO De DP 
0 0.333333 0.333333 

1 0.081994 0.237406 5.7 x 107! 0.0425 0.337 
2 0.071249 0.197870 1.6 x 107! 0.0369 0.188 
3 0.070852 0.190360 3.6 x 107? 0.0367 0.178 
4 0.070837 0.189023 6.6 x 10-3 0.0367 0.176 
5 0.070837 0.188787 1.2 x 107° 0.0367 0.176 
6 0.070837 0.188745 2.1 x 1074 0.0367 0.176 
7 0.070837 0.188738 3.6 x 1075 0.0367 0.176 
8 0.070837 0.188737 6.4 x 1076 0.0367 0.176 


Example 4.3 (Bayesian Posterior Mode) Consider a Bayesian problem with like- 
lihood L(6|x), prior f(@), and missing data or parameters Z. To find the posterior 
mode, the E step requires 


Q(6|0) = Eflog{L(O|Y) f(@)k(Y)} |x, 0} 
= Eflog L(6|Y)|x, 0} + log f(0) + Eflog k(Y)|x, 0}, (4.17) 


where the final term in (4.17) is a normalizing constant that can be ignored because 
Q is to be maximized with respect to 0. This function Q is obtained by simply adding 
the log prior to the Q function that would be used in a maximum likelihood setting. 
Unfortunately, the addition of the log prior often makes it more difficult to maximize 
Q during the M step. Section 4.3.2 describes a variety of methods for facilitating the 
M step in difficult situations. 


4.2.1 Convergence 


To investigate the convergence properties of the EM algorithm, we begin by showing 
that each maximization step increases the observed-data log likelihood, /(0|x). First, 
note that the log of the observed-data density can be reexpressed as 


log fx(x|0) = log fy(yl@) — log fzjx(lx, 0). (4.18) 
Therefore, 
Eflog fx(xl8)|x, 0} = Eflog fy(y|@)|x, 0} — Eflog faix(zlx, 0)|x, 0}, 
where the expectations are taken with respect to the distribution of Z|(x, 6). Thus, 
log fx(x10) = Q(16) — H010”), (4.19) 
where 


H(6|0) = E {log fayx(ZIx, @)| x, at l (4.20) 
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The importance of (4.19) becomes apparent after we show that H(6|0) is 
maximized with respect to @ when 6 = 0. To see this, write 


H(0 |) = H(010®) 
= E {log faix(Zix, 6) — log fyx(Zix, O)| x, 0 } 


fzix(zlx, 0) > 
/ 2 | fa A fax (Zlx, 0) dz 


IV 


— tog f fzx(z|x, 0) dz 
=0. (4.21) 


Equation (4.21) follows from an application of Jensen’s inequality, since — log u is 
strictly convex in u. 

Thus, any 0 + 6 makes H(0|0®) smaller than H(0 0). In particular, if we 
choose 6+ to maximize Q(0|0®) with respect to 0, then 


log fx(x|0“7)) — log fx(xl0) > 0, (4.22) 


since Q increases and H decreases, with strict inequality when oot) 19) > 
006). 

Choosing 0+") at each iteration to maximize Q(0|0) with respect to 0 con- 
stitutes the standard EM algorithm. If instead we simply select any 0") for which 
06+) oO) > Q(6 |e), then the resulting algorithm is called generalized EM, or 
GEM. In either case, a step that increases Q increases the log likelihood. Conditions 
under which this guaranteed ascent ensures convergence to an MLE are explored in 
[60, 676]. 

Having established this result, we next consider the order of convergence for 
the method. The EM algorithm defines a mapping 6+) — Wo) where the function 
WO) = (Y1 (6), ..., Yp(0)) and 8 = (6), ..., 0p). When EM converges, it converges 
to a fixed point of this mapping, so 6 = W(6). Let W'(@) denote the Jacobian matrix 
whose (i, j)th element is dY;(0)/d6;. Taylor series expansion of W yields 


OTD — 6 x woo — 6), (4.23) 


since 0+) — @ = wo) — WÔ). Comparing this result with (2.19), we see that the 
EM algorithm has linear convergence when p = 1. For p > 1, convergence is still 
linear provided that the observed information, —I’’ (0|x), is positive definite. More 
precise details regarding convergence are given in [150, 449, 452, 455]. 
The global rate of EM convergence is defined as 
arr? — Ol 


= im — WH. (4.24) 
p t>0o a — 6l| 


It can be shown that p equals the largest eigenvalue of W (6) when —I’(6|x) is posi- 
tive definite. In Sections 4.2.3.1 and 4.2.3.2 we will examine how W’(@) is a matrix 
of the fractions of missing information. Therefore, p effectively serves as a scalar 
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1(0|x) 


i ~ 
att) ott!) 6+2) 
0 


FIGURE 4.1 One-dimensional illustration of EM algorithm as a minorization or optimization 
transfer strategy. 


summary of the overall proportion of missing information. Conceptually, the propor- 
tion of missing information equals one minus the ratio of the observed information 
to the information that would be contained in the complete data. Thus, EM suffers 
slower convergence when the proportion of missing information is larger. The linear 
convergence of EM can be extremely slow compared to the quadratic convergence of, 
say, Newton’s method, particularly when the fraction of missing information is large. 
However, the ease of implementation and the stable ascent of EM are often very at- 
tractive despite its slow convergence. Section 4.3.3 discusses methods for accelerating 
EM convergence. 
To further understand how EM works, note from (4.21) that 


101x) > Q010) + 10 |x) — (0 10) = Goo). (4.25) 


Since the last two terms in G(6|0) are independent of 6, the functions Q and G are 
maximized at the same 0. Further, G is tangent to / at 6 and lies everywhere below l. 
We say that G is a minorizing function for |. The EM strategy transfers optimization 
from / to the surrogate function G (effectively to Q), which is more convenient to 
maximize. The maximizer of G provides an increase in /. This idea is illustrated in 
Figure 4.1. Each E step amounts to forming the minorizing function G, and each 
M step amounts to maximizing it to provide an uphill step. 

Temporarily replacing / by a minorizing function is an example of a more 
general strategy known as optimization transfer. Links to the EM algorithm and other 
statistical applications of optimization transfer are surveyed in [410]. In mathematical 
applications where it is standard to pose optimizations as minimizations, one typically 
refers to majorization, as one could achieve by majorizing the negative log likelihood 
using —G(6|0). 
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4.2.2 Usage in Exponential Families 


When the complete data are modeled to have a distribution in the exponential family, 
the density of the data can be written as f(y|0) = ci(y)c2(6) exp{0's(y)}, where 0 is 
a vector of natural parameters and s(y) is a vector of sufficient statistics. In this case, 
the E step finds 


(6/0) = k + log c2(0) + / 6" s(y) fax(zlx, 0) dz, (4.26) 


where k is a quantity that does not depend on @. To carry out the M step, set the 
gradient of Q(6|0) with respect to 0 equal to zero. This yields 


et I s(y) fzx (zix, 0) dz (4.27) 
c2(0) 


after rearranging terms and adopting the obvious notational shortcut to vectorize 
the integral of a vector. It is straightforward to show that c4 (0) = —c2(0)E{s(Y)|9}. 
Therefore, (4.27) implies that the M step is completed by setting 0+ equal to the 0 
that solves 


E{s(Y)|0} = j s(y) fyx(z|x, 0) dz. (4.28) 


Aside from replacing 6 with 0°*)), the form of Q(0|6™) is unchanged for the next 
E step, and the next M step solves the same optimization problem. Therefore, the EM 
algorithm for exponential families consists of: 


1. Estep: Compute the expected values of the sufficient statistics for the complete 
data, given the observed data and using the current parameter guesses, 6. Let 
s = E{s(Y)|x, 0} = f s(y) faix(alx, 0) dz. 


2. M step: Set 6°* to the value that makes the unconditional expectation of the 
sufficient statistics for the complete data equal to s. In other words, gt) 
solves E{s(Y)|6)} = s®. 


3. Return to the E step unless a convergence criterion has been met. 


Example 4.4 (Peppered Moths, Continued) The complete data in Example 4.2 
arise from a multinomial distribution, which is in the exponential family. The sufficient 
statistics are, say, the first five genotype counts (with the sixth derived from the 
constraint that the counts total n), and the natural parameters are the corresponding 
log probabilities seen in (4.4). The first three conditional expectations for the E step 
are sue = ak s8 = nË , and s9, = n8, borrowing notation from (4.5)—(4.9) and 
indexing the components of s® in the obvious way. The unconditional expectations 
of the first three sufficient statistics are nDes 2npcPp;, and 2npc py. Equating these 
three expressions with the conditional expectations given above and solving for po 
constitutes the M step for pç. Summing the three equations gives np + 2npcPi + 


2nPcPr = ne. + n + no, which reduces to the update given in (4.13). EM updates 
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for p; and py are found analogously, on noting the constraint that the three probabilities 
sum to 1. 


4.2.3 Variance Estimation 


Ina maximum likelihood setting, the EM algorithm is used to find an MLE but does not 
automatically produce an estimate of the covariance matrix of the MLEs. Typically, 
we would use the asymptotic normality of the MLEs to justify seeking an estimate of 
the Fisher information matrix. One way to estimate the covariance matrix, therefore, 
is to compute the observed information, —I(@|x), where I’ is the Hessian matrix of 
second derivatives of log L(6|x). 

In a Bayesian setting, an estimate of the posterior covariance matrix for 0 can 
be motivated by noting the asymptotic normality of the posterior [221]. This requires 
the Hessian of the log posterior density. 

In some cases, the Hessian may be computed analytically. In other cases, the 
Hessian may be difficult to derive or code. In these instances, a variety of other 
methods are available to simplify the estimation of the covariance matrix. 

Of the options described below, the SEM (supplemented EM) algorithm is easy 
to implement while generally providing fast, reliable results. Even easier is boot- 
strapping, although for very complex problems the computational burden of nested 
looping may be prohibitive. These two approaches are recommended, yet the other 
alternatives can also be useful in some settings. 


4.2.3.1 Louis’s Method Taking second partial derivatives of (4.19) and 
negating both sides yields 


—1" (01x) = — Q’(6|)|_,_4 + H’Ol@)|,_4. (4.29) 


where the primes on Q” and H” denote derivatives with respect to the first argument, 
namely 0. 
Equation (4.29) can be rewritten as 


ix(8) = iy(6) — iz)x(6), (4.30) 


where ix(0) = —I” (@|x) is the observed information, and iy(0) and iz)x(6) will be 
called the complete information and the missing information, respectively. Inter- 
changing integration and differentiation (when possible), we have 


iy(6) = —Q”(6\@)| g = -EU OLY) |x, 0}, (4.31) 


which is reminiscent of the Fisher information defined in (1.28). This motivates calling 
iy(0) the complete information. A similar argument holds for —H”. Equation (4.30), 
stating that the observed information equals the complete information minus the 
missing information, is a result termed the missing-information principle [424, 673]. 
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The missing-information principle can be used to obtain an estimated covariance 
matrix for 6. It can be shown that 


(4.32) 


A dlog fz\x(Z\x, 0) 
izıx(0) = var ~o 


where the variance is taken with respect to fzx. Further, since the expected score is 
zero at 8, 


iz)x(6) = I Sz)x(6)Sz)x(6)" fzıx(z|X, 6) dz, (4.33) 


where 


dlog fzıx(z|x, 0) 


Szıx(0) = 10 


The missing-information principle enables us to express ix(0) in terms of the 
complete-data likelihood and the conditional density of the missing data given the 
observed data, while avoiding calculations involving the presumably complicated 
marginal likelihood of the observed data. This approach can be easier to derive and 
code in some instances, but it is not always significantly simpler than direct calculation 
of —I’(6|x). 

If iy() or iz|x(6) is difficult to compute analytically, it may be estimated via 
the Monte Carlo method (see Chapter 6). For example, the simplest Monte Carlo 
estimate of îy (0) is 


1d log fy(yil) 
> ; (4.34) 
ma dô -d0 
where fori = 1, ..., m, the y; = (X, z;) are simulated complete datasets consisting of 


the observed data and i.i.d. imputed missing-data values z; drawn from fzx. Similarly, 
a simple Monte Carlo estimate of izx(6) is the sample variance of the values of 


—[dlog fzıx(zilx, 0)] /d@ 


obtained from such a collection of z;. 


Example 4.5 (Censored Exponential Data) Suppose we attempt to observed 
complete data under the model Yj, ..., Y, ~ i.i.d. Exp(A), but some cases are right- 
censored. Thus, the observed data are x = (x1, ..., Xn) where x; = (min(yj, ci), ôi), 
the c; are the censoring levels, and ô; = 1 if y; < ci and ô; = 0 otherwise. 
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The complete-data log likelihood is [(Aly1,..., Yn) = nloga — A“) yi. 
Thus, 


O(A|A) = E{I(AY1,.... Yn)[x, AP} (4.35) 


n 
=nloga— aX E{Y¥j|x;, A®} 


i=1 


r 1 
=nloga — 2 vidi + (ci + 5H ) 0- 8) (4.36) 
i= 
n Ch 
= nlogà — > [vidi +c — ôi)] — 30° 


i=1 


(4.37) 


where C = ye — 6;) denotes the number of censored cases. Note that (4.36) 
follows from the memoryless property of the exponential distribution. Therefore, 
—O" (AJA) = n/d?. 

The unobserved outcome for a censored case, Z;, has density f7,)x(zi|x, 4) = 
Aexp{—A(zi — ci} {z;>¢,}. Calculating iz\x(A) as in (4.32), we find 


dlog fa\x(ZIx, 2) _ 


Ai Ca- X (Zizo). (4.38) 


{i: 8;=0} 


The variance of this expression with respect to fz;|x is 


z C 
igxQ)= J varlZi — ci) = 5, (4.39) 
{i: 6;=0} 
since Z; — c; has an Exp(A) distribution. 
Thus, applying Louis’s method, 
X n C U 
Ix(A) = 2 a2 > 2’ (4.40) 


where U = )~"_, 6; denotes the number of uncensored cases. For this elementary 
example, it is easy to confirm by direct analysis that —//’(A|x) = U/A?. 


4.2.3.2 SEM Algorithm Recall that W denotes the EM mapping, having fixed 
point 6 and Jacobian matrix W’(6) with (i, j)th element equaling dW; (0)/d6 j. Dempster 
et al. [150] show that 


W'(6)" = iz)x(@)iy@)' (4.41) 


in the terminology of (4.30). 
If we reexpress the missing information principle in (4.30) as 


ix(6) = [I — iz;x@)iy() | Jiy(6), (4.42) 
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where I is an identity matrix, and substitute (4.41) into (4.42), then inverting ix(6) 
provides the estimate 


var{6} = îy Â)! (1 +w 6) I - vO!) f (4.43) 


This result is appealing in that it expresses the desired covariance matrix as the 
complete-data covariance matrix plus an incremental matrix that takes account of 
the uncertainty attributable to the missing data. When coupled with the following 
numerical differentiation strategy to estimate the increment, Meng and Rubin have 
termed this approach the supplemented EM (SEM) algorithm [453]. Since numerical 
imprecisions in the differentiation approach affect only the estimated increment, es- 
timation of the covariance matrix is typically more stable than the generic numerical 
differentiation approach described in Section 4.2.3.5. 

Estimation of (6) proceeds as follows. The first step of SEM is to run the EM 
algorithm to convergence, finding the maximizer 0. The second step is to restart the 
algorithm from, say, 9, Although one may restart from the original starting point, 
it is preferable to choose 6) to be closer to 6. 

Having thus initialized SEM, we begin SEM iterations for t = 0, 1, 2, .... The 
(t + 1)th SEM iteration begins by taking a standard Estep and M sh w produce gt) 
from 6. Next, for j= 1,..., p, define 6 (j) = (@1,..., Oj-1, Oe LO tances Ôp) 
and 

(9 7)) Â: 
7 = = a) i (4.44) 
8; — 0j 


fori =1,..., p, recalling that ¥(6) = 6. This ends one SEM iteration. The W;(0(j)) 
values are the estimates produced by applying one EM cycle to oO J) for j= 
| eee 17 

Notice that the (i, j)th element of W (6) equals lim;— oo i ) We Er one 


each element of this matrix to be precisely estimated when the sequence of rt F ) values 
stabilizes for t > tie Note that different numbers of iterations may be needed for 


precise estimation of different elements of Y’ (6). When all elements have stabilized, 
SEM iterations stop and the resulting estimate of Y’ (0) i is used to determine var {6} as 
given in (4.43). 

Numerical imprecision can cause the resulting covariance matrix to be slightly 
asymmetric. Such asymmetry can be used to diagnose whether the original EM pro- 
cedure was run to sufficient precision and to assess how many digits are trustworthy 
in entries of the estimated covariance matrix. Difficulties also arise if I — W’ O is 
not positive semidefinite or cannot be inverted numerically; see [453]. It has been 
suggested that transforming 0 to achieve an approximately normal likelihood can 
lead to faster convergence and increased accuracy of the final solution. 


Example 4.6 (Peppered Moths, Continued) The results from Example 4.2 can 
be supplemented using the approach of Meng and Rubin [453]. Stable, precise 


results are obtained within a few SEM iterations, starting from pe ) = 0.07 and 


110 9 CHAPTER4 EM OPTIMIZATION METHODS 


p? = 0.19. Standard errors for pc, py, and py are 0.0074, 0.0119, and 0.0132, 
respectively. Pairwise correlations are cor{pc, ĝi} = —0.14, cor{ Ôc, pr} = —0.44, 
and cor{p;, Py} = —0.83. Here, SEM was used to obtain results for po and p,, and 
elementary relationships among variances, covariances, and correlations were used 
to extend these results for py; since the estimated probabilities sum to 1. 


It may seem inefficient not to begin SEM iterations until EM iterations have 
ceased. An alternative would be to attempt to estimate the components of W’(6) as 
EM iterations progress, using 


(t-1) C-D 9) gi- =j -1 
0 WO piney 0s O pind — POTD) 


1J (t) (t—1) 


(4.45) 


However, Meng and Rubin [453] argue that this approach will not require fewer 
iterations overall, that the extra steps required to find Ô first can be offset by starting 
SEM closer to 6, and that the alternative is numerically less stable. Jamshidian and 
Jennrich survey a variety of methods for numerically differentiating Y or I’ itself, 
including some they consider superior to SEM [345]. 


4.2.3.3 Bootstrapping Thorough discussion of bootstrapping is given in Chap- 
ter 9. In its simplest implementation, bootstrapping to obtain an estimated covariance 


matrix for EM would proceed as follows for i.i.d. observed data x, ..., Xp: 
1. Calculate Og using a suitable EM approach applied to x1, ...,X,. Let j = 1 
and set 0; = 0g. 
2. Increment j. Sample pseudo-data X},...,X* completely at random from 
X1,-..,X, with replacement. 


3. Calculate 6; by applying the same EM approach to the pseudo-data X7, ..., X*. 


n 


4. Stop if j is large enough; otherwise return to step 2. 


For most problems, a few thousand iterations will suffice. At the end of the process, we 
have generated a collection of parameter estimates, 6 Ly ens 6 B, Where B denotes the 
total number of iterations used. Then the sample variance of these B estimates is the 
estimated variance of 6. Conveniently, other aspects of the sampling distribution of Ô, 
such as correlations and quantiles, can be estimated using the corresponding sample 
estimates based on ĝÎ4, ..., Ôg. Note that bootstrapping embeds the EM loop in a 
second loop of B iterations. This nested looping can be computationally burdensome 
when the solution of each EM problem is slow because of a high proportion of missing 
data or high dimensionality. 


4.2.3.4 Empirical Information When the data are i.i.d., note that the score 
function is the sum of individual scores for each observation: 


dlog fx(x|0) 


a = l (0|x) = 2 V (6|x;), (4.46) 
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where we write the observed dataset as x = (X1, ..., Xn). Since the Fisher information 
matrix is defined to be the variance of the score function, this suggests estimating 
the information using the sample variance of the individual scores. The empirical 
information is defined as 


l ya V (6|x)V (0x)! — L Yoo". (4.47) 
n i=l n 


This estimate has been discussed in the EM context in [450, 530]. The appeal of this 
approach is that all the terms in (4.47) are by-products of the M step: No additional 
analysis is required. To see this, note that 6 maximizes Q(6|6) — 1(0|x) with 
respect to 0. Therefore, taking derivatives with respect to 0, 


0' (00) p0 = LOIN) (4.48) 


6=0" 


Since Q’ is ordinarily calculated at each M step, the individual terms in (4.47) are 
available. 


4.2.3.5 Numerical Differentiation To estimate the Hessian, consider comput- 
ing the numerical derivative of I’ at 6, one coordinate at a time, using (1.10). The first 
row of the estimated Hessian can be obtained by adding a small perturbation to the 
first coordinate of 6, then computing the ratio of the difference between I’(0) at 0 = ô 
and at the perturbed value, relative to the magnitude of the perturbation. The remain- 
ing rows of the Hessian are approximated similarly. If a perturbation is too small, 
estimated partial derivatives may be inaccurate due to roundoff error; if a perturba- 
tion is too big, the estimates may also be inaccurate. Such numerical differentiation 
can be tricky to automate, especially when the components of 6 have different scales. 
More sophisticated numerical differentiation strategies are surveyed in [345]. 


4.3 EM VARIANTS 


4.3.1 Improving the E Step 


The E step requires finding the expected log likelihood of the complete data con- 
ditional on the observed data. We have denoted this expectation as 0(6|a). When 
this expectation is difficult to compute analytically, it can be approximated via Monte 
Carlo (see Chapter 6). 


4.3.1.1 Monte Carlo EM Wei and Tanner [656] propose that the tth E step can 
be replaced with the following two steps: 


1. Draw missing datasets Z0: ee Zz. iid. from fz)x(z|x, 6), Each Z is a 
m J 
vector of all the missing values needed to complete the observed dataset, so 
Y; = (x, Zj) denotes a completed dataset where the missing values have been 
replaced by Zj. 
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mm" log fx(Y.? 16). 
Then O“+)(9|6) is a Monte Carlo estimate of Q(0|0®). The M step is modified to 
maximize 0+) (9/9). 

The recommended strategy is to let m” be small during early EM iterations and 
to increase m as iterations progress to reduce the Monte Carlo variability introduced 
in Q. Nevertheless, this Monte Carlo EM algorithm (MCEM) will not converge in 
the same sense as ordinary EM. As iterations proceed, values of 0 will eventually 
bounce around the true maximum, with a precision that depends on m™. Discussion 
of the asymptotic convergence properties of MCEM is provided in [102]. A stochastic 
alternative to MCEM is discussed in [149]. 


2. Calculate O°+) (6/0) = (1/m) > 


Example 4.7 (Censored Exponential Data, Continued) In Example 4.5, it was 
easy to compute the conditional expectation of (AY) = n log’ — à X`;—; Y; given 
the observed data. The result, given in (4.37), can be maximized to provide the ordinary 
EM update, 


n 


ee 4.49 
Application of MCEM is also easy. In this case, 
N mO 
ÒCtHDaJA®) = n log à — -0 Sv (4.50) 
j=l 


where 1 is a vector of ones and Y ; is the jth completed dataset comprising the uncen- 
sored data and simulated data Z; = (Zj1,..., Zjc) with Zik — ck ~ iid. Exp(a) 
fork = 1,..., C to replace the censored values. Setting O'(a|A) = 0 and solving 
for A yields 


Naa ee (4.51) 
yr YT /mO 

j=l j 
as the MCEM update. 

The website for this book provides n = 30 observations, including C = 17 
censored observations. Figure 4.2 compares the performance of MCEM and ordinary 
EM for estimating A with these data. Both methods easily find the MLE Â = 0.2185. 
For MCEM, we used m® = 5!1+11/10] where |z] denotes the integer part of z. Fifty 
iterations were used altogether. Both algorithms were initiated from 4© = 0.5042, 


which is the mean of all 30 data values disregarding censoring. 


4.3.2 Improving the M Step 


One of the appeals of the EM algorithm is that the derivation and maximization of 
Q(6|6) is often simpler than incomplete-data maximum likelihood calculations, 
since Q(6|0) relates to the complete-data likelihood. In some cases, however, the 
M step cannot be carried out easily even though the E step yielding 0(0|0) is 
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FIGURE 4.2 Comparison of iterations for EM (solid) and MCEM (dotted) for the censored 
exponential data discussed in Example 4.7. 


straightforward. Several strategies have been proposed to facilitate the M step in such 
cases. 


4.3.2.1 ECM Algorithm Meng and Rubin’s ECM algorithm replaces the M step 
with a series of computationally simpler conditional maximization (CM) steps [454]. 
Each conditional maximization is designed to be a simple optimization problem that 
constrains 0 to a particular subspace and permits either an analytical solution or a 
very elementary numerical solution. 

We call the collection of simpler CM steps after the tth E step a CM cycle. 
Thus, the fth iteration of ECM is composed of the tth E step and the tth CM cycle. 
Let S denote the total number of CM steps in each CM cycle. For s = 1,..., S, the 
sth CM step in the tth cycle requires the maximization of Q(010®) subject to (or 
conditional on) a constraint, say 


2.(0) = g (0€ +6=D/9) (4.52) 


where 0°*°—/®) is the maximizer found in the (s — 1)th CM step of the current cycle. 


When the entire cycle of S steps of CM has been completed, we set oft) = g@+S/5) 
and proceed to the E step for the (t + 1)th iteration. 

Clearly any ECM is a GEM algorithm (Section 4.2.1), since each CM step 
increases Q. In order for ECM to be convergent, we need to ensure that each CM 
cycle permits search in any direction for a maximizer of Q(010®), so that ECM 
effectively maximizes over the original parameter space for 6 and not over some 
subspace. Precise conditions are discussed in [452, 454]; extensions of this method 
include [415, 456]. 

The art of constructing an effective ECM algorithm lies in choosing the con- 
straints cleverly. Usually, it is natural to partition 0 into S subvectors, 0 = (01, ..., 0s). 
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Then in the sth CM step, one might seek to maximize Q with respect to 0; while hold- 
ing all other components of 0 fixed. This amounts to the constraint induced by the 
function gs(0) = (01, ..., 9s—1, 9541, ..., 45). A maximization strategy of this type 
has previously been termed iterated conditional modes [36]. If the conditional maxi- 
mizations are obtained by finding the roots of score functions, the CM cycle can also 
be viewed as a Gauss-Seidel iteration (see Section 2.2.5). 

Alternatively, the sth CM step might seek to maximize Q with respect to all 
other components of 0 while holding 0, fixed. In this case, g;(0) = 0s. Additional 
systems of constraints can be imagined, depending on the particular problem context. 
A variant of ECM inserts an E step between each pair of CM steps, thereby updating 
Q at every stage of the CM cycle. 


Example 4.8 (Multivariate Regression with Missing Values) A particularly il- 
luminating example given by Meng and Rubin [454] involves multivariate regression 
with missing values. Let U;,...,U, be n independent d-dimensional vectors ob- 
served from the d-variate normal model given by 


U; ~ Na (Hi, £) (4.53) 


for U; = (Ui, ..., Uia) and u; = ViB, where the V; are known d x p design ma- 
trices, B is a vector of p unknown parameters, and © is ad x d unknown variance- 
covariance matrix. There are many cases where X has some meaningful structure, but 
we consider È to be unstructured for simplicity. Suppose that some elements of some 
U; are missing. 

Begin by reordering the elements of U;, ;, and the rows of V; so that for each 
i, the observed components of U; are first and any missing components are last. For 
each U;, denote by 8; and £; the corresponding reorganizations of the parameters. 
Thus, 8; and X; are completely determined by £, £, and the pattern of missing data: 
They do not represent an expansion of the parameter space. 

This notational reorganization allows us to write U; = (Uobs,i, Umiss,i), Mi = 


(Mobs.i» Bisah and 


y= Lobs,i cross, i a 54) 
da A Lmiss,i f i 
The full set of observed data can be denoted Uops = (Uobs,1, .-- , Uobs,n)- 


The observed-data log likelihood function is 


ee RE E 
IB, ElWons) = —5 X log |Zovs.il — 5 9 (Uobs,i — Monsi)” Eagps,i(Uobs,i — Mobs.i) 


i=1 i=1 


up to an additive constant. This likelihood is quite tedious to work with or to maximize. 
Note, however, that the complete-data sufficient statistics are given by )>j_, Uj; for 
j=1,...,dand yy Uj; Uix for j, k = 1,..., d. Thus, the E step amounts to finding 
the expected values of these sufficient statistics conditional on the observed data and 
current parameter values B® and =. 
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Now for j= 1,...,d 


n n 
E XO Vij uos: BO, zo) =a? (4.55) 
i=l i=l 
where 
(1) a® if Uj; is missing, 
aj = 1J : (4.56) 
uij if Uj; = uij is observed, 
(C) O yO, gim; PEOS 
and aij = E{U;j|Uobs,i; Bi o £; }- Similarly, for j,k = 1,...,d, 
n 
E [ YO UijUir| woos, B®, zo) = Seda +22), (4.57) 
i=l i=l 
where 
(tt): sy 
~ if U;; and Ujx are both missing, 
De en oe i (4.58) 
0 otherwise, 


and Vin = cov{Ujj, Vik lWors,is BP, EP). 
Fortunately, the derivation of the a and ve is fairly straightforward. The 
conditional distribution of Umiss,i|(Uobs,i, pe? , EP) is 


(t) -1 (t) -1 T 
N (i i + Leross,i X miss, i (Uobs, i gz Hobs, i) Xobs,i =j Lcross,i miss, i cross, i : 


(t) 


t 3 : 
The values for or ) and Vijg can be read from the mean vector and variance—covariance 


matrix of this disiribunior, respectively. Knowing these, Q(£, X| B®, EO) can be 
formed following (4.26). 

Having thus achieved the E step, we turn now to the M step. The high dimen- 
sionality of the parameter space and the complexity of the observed-data likelihood 
renders difficult any direct implementation of the M step, whether by direct maxi- 
mization or by reference to the exponential family setup. However, implementing an 
ECM strategy is straightforward using S = 2 conditional maximization steps in each 
CM cycle. 

Treating B and & separately allows easy constrained optimizations of Q. First, 
if we impose the constraint that © = £©, then we can maximize the constrained 
version of Q(B, X|B, £O) with respect to B by using the weighted least squares 
estimate 


n = n 
pet?) = (>: via] (>: vay) l (4.59) 
i=1 


i=1 
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where a® = Cre aed and zO is treated as a known variance—covariance 
matrix. This ensures that O(B°*!/2), EOIBO, EO) > Q(B, LB, EO). This 
constitutes the first of two CM steps. 

The second CM step follows from the fact that setting £“*?/”) equal to 


1 n 
r [ES — wi = vis | wan pt, z0) (4.60) 
i=1 


maximizes Q(B, X|B, £O) with respect to X subject to the constraint that B = 
BCTI? because this amounts to plugging in at and vit values where necessary 
and computing the sample covariance matrix of the completed data. This update 


guarantees 


opi? ECHO, E”) Soper, rp, rO) 
>Q”, ZOB”, ZO). (4.61) 


Together, the two CM steps yield (BT), EED) = (B612, ¥47/2)) and ensure 
an increase in the Q function. 

The E step and the CM cycle described here can each be implemented using 
familiar closed-form analytic results; no numerical integration or maximization is 
required. After updating the parameters with the CM cycle described above, we return 
to another E step, and so forth. In summary, ECM alternates between (i) creating 
updated complete datasets and (ii) sequentially estimating £ and © in turn by fixing 
the other at its current value and using the current completed-data component. 


4.3.2.2 EM Gradient Algorithm If maximization cannot be accomplished 
analytically, then one might consider carrying out each M step using an iterative 
numerical optimization approach like those discussed in Chapter 2. This would yield 
an algorithm that had nested iterative loops. The ECM algorithm inserts S conditional 
maximization steps within each iteration of the EM algorithm, also yielding nested 
iteration. 

To avoid the computational burden of nested looping, Lange proposed replac- 
ing the M step with a single step of Newton’s method, thereby approximating the 
maximum without actually solving for it exactly [407]. The M step is replaced with 
the update given by 


a) = 9 — QOO] y QO) (4.62) 


6=0 


= 6 = QOO VOW (4.63) 


/ 
—9) l 
where I’ (6 ) |x) is the evaluation of the score function at the current iterate. Note that 
(4.63) follows from the observation in Section 4.2.3.4 that 0 maximizes 0(6|0) — 
1(0|x). This EM gradient algorithm has the same rate of convergence to @ as the full 
EM algorithm. Lange discusses conditions under which ascent can be ensured, and 
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FIGURE 4.3 Steps taken by the EM gradient algorithm (long dashes). Ordinary EM steps 
are shown with the solid line. Steps from two methods from later sections (Aitken and quasi- 
Newton acceleration) are also shown, as indicated in the key. The observed-data log likelihood 
is shown with the gray scale, with light shading corresponding to high likelihood. All algorithms 


were started from po = py = L, 


scalings of the update increment to speed convergence [407]. In particular, when Y 
has an exponential family distribution with canonical parameter 0, ascent is ensured 
and the method matches that of Titterington [634]. In other cases, the step can be 
scaled down to ensure ascent (as discussed in Section 2.2.2.1), but inflating steps 
speeds convergence. For problems with a high proportion of missing information, 
Lange suggests considering doubling the step length [407]. 


Example 4.9 (Peppered Moths, Continued) Continuing Example 4.2, we apply 
the EM gradient algorithm to these data. It is straightforward to show 
(t) (t) (t) 


POMP) _ Anco ng + nce — 2th + ney + a dei 
dpe Pe (l= po - pp? ’ 

POPPO) —— 2n tniyp tn nG tnng 

la l um (4.65) 
dp; PI (1 — pce- py) 
and 

POPPP) _ _ 2af + ner + mir a 

dpc dp, d- Po Po)? 


Figure 4.3 shows the steps taken by the resulting EM gradient algorithm, starting 
from pe = Py = Pr = F, Step halving was implemented to ensure ascent. The first 
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step heads somewhat in the wrong direction, but in subsequent iterations the gradient 
steps progress quite directly uphill. The ordinary EM steps are shown for comparison 
in this figure. 


4.3.3 Acceleration Methods 


The slow convergence of the EM algorithm is a notable drawback. Several techniques 
have been suggested for using the relatively simple analytic setup from EM to motivate 
particular forms for Newton-like steps. In addition to the two approaches described 
below, approaches that cleverly expand the parameter space in manners that speed 
convergence without affecting the marginal inference about 0 are topics of recent 
interest [421, 456]. 


4.3.3.1 Aitken Acceleration Let ae be the next iterate obtained by the stan- 
dard EM algorithm from 6. Recall that the Newton update to maximize the log 
likelihood would be 

0) = 9 — V6 |x) (0P |x). (4.67) 


The EM framework suggests a replacement for I’ (6 |x). In Section 4.2.3.4 we noted 
that I’(6 |x) = Q’ (6\0)| po) Expanding Q’ around 6, evaluated at ot +D yields 


Qoo) — îy (0O NOE. — 0), (4.68) 


go” xQ (10) 


where îy(0®) is defined in (4.31). Since Oe D maximizes 0(0|0) with respect to 
0, the left-hand side of (4.68) equals zero. Therefore 


W610)! n ivo Oo — 00). (4.69) 
Thus, from (4.67) we arrive at 
otd — 9 — OA E OTV — 6), (4.70) 


This update—relying on the approximation in (4.69)—is an example of a general 
strategy known as Aitken acceleration and was proposed for EM by Louis [424]. 
Aitken acceleration of EM is precisely the same as applying the Newton—Raphson 
method to find a zero of W(@) — 0, where W is the mapping defined by the ordinary 
EM algorithm producing C+D = woe) [343]. 


Example 4.10 (Peppered Moths, Continued) This acceleration approach can be 
applied to Example 4.2. Obtaining 1” is analytically more tedious than the simpler 
derivations employed for other EM approaches to this problem. Figure 4.3 shows the 
Aitken accelerated steps, which converge quickly to the solution. The procedure was 
started from pc = pı = pr = i and step halving was used to ensure ascent. 
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Aitken acceleration is sometimes criticized for potential numerical instabilities 
and convergence failures [153, 344]. Further, when I’ (0|x) is difficult to compute, this 
approach cannot be applied without overcoming the difficulty [20, 345, 450]. 

Section 4.2.1 noted that the EM algorithm converges at a linear rate that de- 
pends on the fraction of missing information. The updating increment given in (4.70) 
is, loosely speaking, scaled by the ratio of the complete information to the observed in- 
formation. Thus, when a greater proportion of the information is missing, the nominal 
EM steps are inflated more. 

Newton’s method converges quadratically, but (4.69) only becomes a precise 
approximation as 6 nears 0. Therefore, we should only expect this acceleration 
approach to enhance convergence only as preliminary iterations hone 0 sufficiently. 
The acceleration should not be employed without having taken some initial iterations 
of ordinary EM so that (4.69) holds. 


4.3.3.2 Quasi-Newton Acceleration The quasi-Newton optimization method 
discussed in Section 2.2.2.3 produces updates according to 


6) = 9 — (MOY (6 |x) (4.71) 
for maximizing 1(6|x) with respect to 0, where M® is an approximation to I” (0 |x). 


Within the EM framework, one can decompose I’(6 |x) into a part computed during 
EM and a remainder. By taking two derivatives of (4.19), we obtain 


V’(0 jx) = Q”(0/0) 


ae H” (0/0) ane (4.72) 


at iteration t. The remainder is the last term in (4.72); suppose we approximate it by 


B®., Then by using 


MË = Q’(6|0) — B® (4.73) 


6=0 


in (4.71) we obtain a quasi-Newton EM acceleration. 

A key feature of the approach is how B® approximates H” (09 |6). The idea 
is to start with B® = 0 and gradually accumulate information about H” as iterations 
progress. The information is accumulated using a sequence of secant conditions, as 
is done in ordinary quasi-Newton approaches (Section 2.2.2.3). 

Specifically, we can require that B® satisfy the secant condition 


Bot Dg® = b”, (4.74) 
where 

a® = g@t) _ 9 (4.75) 
and 


(4.76) 


(t+1) 


b® = Hoot!) 
0=0 


1 (t+1) 
= g: 
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Recalling the update (2.49), we can satisfy the secant condition by setting 
BUD = BOL COyVO WOE (4.77) 


where v = b® — B%a® and c® = 1/[(v)? a]. 

Lange proposed this guasi-Newton EM algorithm, along with several sug- 
gested strategies for improving its performance [408]. First, he suggested starting 
with B© = 0. Note that this implies that the first increment will equal the EM gra- 
dient increment. Indeed, the EM gradient approach is exact Newton—Raphson for 
maximizing Q(010®), whereas the approach described here evolves into approximate 
Newton—Raphson for maximizing /(@|x). 

Second, Davidon’s [134] update is troublesome if (v®)Ta® = 0 or is small 
compared to ||v || - ||a® ||. In such cases, we may simply set B+) = B®, 

Third, there is no guarantee that M® = Q” (010®)| g0 — B® will be negative 
definite, which would ensure ascent at the tth step. Therefore, we may scale B® 
and use M® = Q" (9/0) | g0 —aB” where, for example, a”) = 2~” for the 
smallest positive integer that makes M® negative definite. 

Finally, note that b® may be expressed entirely in terms of Q’ functions since 


() _ pg’ (t+1) 1 (+1) 
bO = OD) y = OD] (4.78) 
—()_ FW (t+1) 
=0-H(O|6é ) PET (4.79) 
= OO) y VOD] a: (4.80) 


Equation (4.79) follows from (4.19) and the fact that /(@|x) — Q (6/0) has its min- 
imum at 0 = 6. The derivative at this minimum must be zero, forcing I’ (0 |x) = 
Q'(010®)| pgo, Which allows (4.80). 


Example 4.11 (Peppered Moths, Continued) We can apply quasi-Newton 
acceleration to Example 4.2, using the expressions for Q” given in (4.64)-(4.66) 
and obtaining b from (4.80). The procedure was started from Pce = Pi = Pr = 5 
and B = 0, with step halving to ensure ascent. 

The results are shown in Figure 4.3. Note that B® = 0 means that the first quasi- 
Newton EM step will match the first EM gradient step. The second quasi-Newton EM 
step completely overshoots the ridge of highest likelihood, resulting in a step that is 
just barely uphill. In general, the quasi-Newton EM procedure behaves like other 
quasi-Newton methods: There can be a tendency to step beyond the solution or to 
converge to a local maximum rather than a local minimum. With suitable safeguards, 
the procedure is fast and effective in this example. 


The quasi-Newton EM requires the inversion of M at step t. Lange et al. de- 
scribe a quasi-Newton approach based on the approximation of —1” (0 |x) by some 
M that relies on an inverse-secant update [409, 410]. In addition to avoiding compu- 
tationally burdensome matrix inversions, such updates to 0 and B can be expressed 
entirely in terms of I’ (6 |x) and ordinary EM increments when the M step is solvable. 
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TABLE4.2 Frequencies of respondents reporting numbers of risky sexual encounters; 
see Problem 4.2. 
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Encounters, i 0 1 2 3 4 5 6 7 8 
Frequency, n; 379 299 222 145 109 95 73 59 45 


Encounters, i 9 10 11 12 13 14 15 16 
Frequency, n; 30 24 12 4 2 0 1 1 


Jamshidian and Jennrich elaborate on inverse-secant updating and discuss the 
more complex BFGS approach [344]. These authors also provide a useful survey of 
a variety of EM acceleration approaches and a comparison of effectiveness. Some 
of their approaches converge faster on examples than does the approach described 
above. In a related paper, they present a conjugate gradient acceleration of EM [343]. 


PROBLEMS 


4.1. Recall the peppered moth analysis introduced in Example 4.2. In the field, it is quite 
difficult to distinguish the insularia and typica phenotypes due to variations in wing 
color and mottle. In addition to the 622 moths mentioned in the example, suppose the 
sample collected by the researchers actually included ny = 578 more moths that were 
known to be insularia or typical but whose exact phenotypes could not be determined. 


4.2. 


a. 


ons 


Derive the EM algorithm for maximum likelihood estimation of pç, pr, and py for 
this modified problem having observed data nç, ny, ny, and ny as given above. 


. Apply the algorithm to find the MLEs. 


. Estimate the standard errors and pairwise correlations for pc, py, and p, using the 


SEM algorithm. 


. Estimate the standard errors and pairwise correlations for po, p;, and p, by boot- 


strapping. 


. Implement the EM gradient algorithm for these data. Experiment with step halving 


to ensure ascent and with other step scalings that may speed convergence. 


f. Implement Aitken accelerated EM for these data. Use step halving. 


ya 


. Implement quasi-Newton EM for these data. Compare performance with and with- 


out step halving. 


. Compare the effectiveness and efficiency of the standard EM algorithm and the three 


variants in (e), (f), and (g). Use step halving to ensure ascent with the three variants. 
Base your comparison on a variety of starting points. Create a graph analogous to 
Figure 4.3. 


Epidemiologists are interested in studying the sexual behavior of individuals at risk 
for HIV infection. Suppose 1500 gay men were surveyed and each was asked how 
many risky sexual encounters he had in the previous 30 days. Let n; denote the number 
of respondents reporting i encounters, for i=1,..., 16. Table 4.2 summarizes the 
responses. 
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These data are poorly fitted by a Poisson model. It is more realistic to assume 
that the respondents comprise three groups. First, there is a group of people who, 
for whatever reason, report zero risky encounters even if this is not true. Suppose a 
respondent has probability a of belonging to this group. 

With probability £, a respondent belongs to a second group representing typical 
behavior. Such people respond truthfully, and their numbers of risky encounters are 
assumed to follow a Poisson(jz) distribution. 

Finally, with probability 1 — œ — £, a respondent belongs to a high-risk group. 
Such people respond truthfully, and their numbers of risky encounters are assumed to 
follow a Poisson(A) distribution. 

The parameters in the model are a, 6, u, and à. At the rth iteration of EM, we 
use 0 = (a, B®, w, 2) to denote the current parameter values. The likelihood of 
the observed data is given by 


16 ni 
m;(8) | 
L(O|No, ..-, n16) X II | F) ; (4.81) 
i=0 
where 
m0) = ofl =o) + Bu’ exp{—u} + (1 — æ — B)d' exp{—A} (4.82) 
fori=1,..., 16. 
The observed data are no, ..., “16. The complete data may be construed to be 
120,110 +++ M116, ANd Ny 9, ..., Np 16, Where ng ; denotes the number of respondents in 


group k reporting i risky encounters and k = z, t, and p correspond to the zero, typical, 
and promiscuous groups, respectively. Thus, no = nzo + nso +npoandn; = ni + Np, 
fori=1,...,16.Let N = >>,°nm; = 1500. 


Define 
(6) = — (4.83) 
ETO 
Bu' exp{— nu} 
t0) = —————_,, 4.84 
(0) LO (4.84) 
(1 — æ — p)à' exp{—A} 
p(0) = r (4.85) 
m(0) 
for i=0,...,16. These correspond to probabilities that respondents with i risky 


encounters belong to the various groups. 


a. Show that the EM algorithm provides the following updates: 


(1) 
gt) — nozo(0 ) (4.86) 
N 
16 (t) 
(t+1) __ njt(0 ) 4.87 
pre = > -N `’ (4.87) 


i=0 


4.3. 


4.4. 
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16 ~. 
a+) _ Doo Ent) 
a 16 2 
i=0 niti(0®) 


16) 
yee = Leino mir”) 
oe mi PiO) 


b. Estimate the parameters of the model, using the observed data. 


u (4.88) 


(4.89) 


c. Estimate the standard errors and pairwise correlations of your parameter estimates, 
using any available method. 


The website for this book contains 50 trivariate data points drawn from the N3(mu, X) 
distribution. Some data points have missing values in one or more coordinates. Only 
27 of the 50 observations are complete. 


a. Derive the EM algorithm for joint maximum likelihood estimation of u and È. It 
is easiest to recall that the multivariate normal density is in the exponential family. 


b. Determine the MLEs from a suitable starting point. Investigate the performance of 
the algorithm, and comment on your results. 


c. Consider Bayesian inference for p when 


1 06 1.2 
x= {06 05 05 
12 05 3.0 


is known. Assume independent priors for the three elements of u. Specifically, let 
the jth prior be 


exp{—(14; — @;)/B;} 
7> 
pi [1 + exp{—(u; — @))/B;}] 
where (a, @, 03) = (2,4, 6) and 6; =2 for j= 1,2,3. Comment on difficul- 
ties that would be faced in implementing a standard EM algorithm for estimating 


the posterior mode for u. Implement a gradient EM algorithm, and evaluate its 
performance. 


fu) = 


d. Suppose that X is unknown in part (c) and that an improper uniform prior is adopted, 
that is, f(£) « 1 for all positive definite X. Discuss ideas for how to estimate the 
posterior mode for u and X. 


Suppose we observe lifetimes for 14 gear couplings in certain mining equipment, as 
given in Table 4.3 (in years). Some of these data are right censored because the equip- 
ment was replaced before the gear coupling failed. The censored data are in parentheses; 
the actual lifetimes for these components may be viewed as missing. 

Model these data with the Weibull distribution, having density function f(x) = 
abx’—! exp{—ax?} for x > 0 and parameters a and b. Recall that Problem 2.3 in 
Chapter 2 provides more details about such models. Construct an EM algorithm to 
estimate a and b. Since the Q function involves expectations that are analytically un- 
available, adopt the MCEM strategy where necessary. Also, optimization of Q cannot 
be completed analytically. Therefore, incorporate the ECM strategy of conditionally 
maximizing with respect to each parameter separately, applying a one-dimensional 
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TABLE 4.3 Fourteen lifetimes for mining equipment gear couplings, in years. 
Right-censored values are in parenthesis. In these cases, we know only that the 
lifetime was at least as long as the given value. 


(6.94) 5.50 4.54 2.14 (3.65) (3.40) (4.38) 
10.24 4.56 9.42 (4.55) (4.15) 5.64 (10.23) 


Newton-like optimizer where needed. Past observations suggest (a, b) = (0.003, 2.5) 
may be a suitable starting point. Discuss the convergence properties of the procedure 
you develop, and the results you obtain. What are the advantages and disadvantages of 
your technique compared to direct maximization of the observed-data likelihood using, 
say, a two-dimensional quasi-Newton approach? 


A hidden Markov model (HMM) can be used to describe the joint probability of a 
sequence of unobserved (hidden) discrete-state variables, H = (Ho,..., Hn), and a 
sequence of corresponding observed variables O = (Oo, ..., O,,) for which O; is de- 
pendent on H; for each i. We say that H; emits O;; consideration here is limited to 
discrete emission variables. Let the state spaces for elements of H and O be H and €, 
respectively. 

Let O<; and O,; denote the portions of O with indices not exceeding j and 
exceeding j, respectively, and define the analogous partial sequences for H. Under an 
HMM, the H; have the Markov property 


P[H,|H<i-1, Oo] = PLAi| Hi-1] (4.90) 
and the emissions are conditionally independent, so 
P[O;|H, O<;-1, O5;] = P[O;| Hi]. (4.91) 


Time-homogeneous transitions of the hidden states are governed by transition prob- 

abilities p(h, h*) = PL Aj, = h*|H; = h] for h, h* € H. The distribution for Hp is 

parameterized by z(h) = P[ Hp = h] for h € H. Finally, define emission probabilities 

e(h, 0) = P[O; = o|H; = h] forh € Hando € E. Then the parameter set 0 = (x, P, E) 

completely parameterizes the model, where z is a vector of initial-state probabilities, 

P is a matrix of transition probabilities, and E is a matrix of emission probabilities. 
For an observed sequence 0, define the forward variables to be 


a(i, h) = P[O.; = 04, H; = h] (4.92) 


and the backward variables to be 


BG, h) = P[O; = 0,;|H; = h] (4.93) 
fori = 1,...,nandeachh € H. Our notation suppresses the dependence of the forward 
and backward variables on 6. Note that 

P[O = 0/6] = X a(n, h) = 5 m(h)e(h, 09) BO, h). (4.94) 
heH heH 


The forward and backward variables are also useful for computing the probability that 
state h occurred at the ith position of the sequence given O = o according to P[H; = 
h|O = 0, 0] = yee a(i, h) BC, h)/ P[O = 0|@], and expectations of functions of the 
states with respect to these probabilities. 
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a. Show that the following algorithms can be used to calculate a(i, h) and B(i, h). The 
forward algorithm is 
e Initialize a(0, h) = m(h)e(h, oo). 
e Fori=0,...,n— l, leta(i + 1, h) = X pey ai, h*)p(h*, h)e(h, oj41). 
The backward algorithm is 
e Initialize B(n, h) = 1. 
e Fori =n,..., 1, let 8(i — 1, h) = Nien pth, h*)e(h*, 0;)B(h, i). 
These algorithms provide very efficient methods for finding P[O = 0|6] and other 


useful probabilities, compared to naively summing over all possible sequences of 
states. 


b. Let N(h) denote the number of times Ho = h, let N(h, h*) denote the number of 
transitions from h to h*, and let N(h, o) denote the number of emissions of o when 
the underlying state is h. Prove that these random variables have the following 
expectations: 


a(0, h)B(O, h) 


E{N(h)} = PIO = old] ` (4.95) 
n—-1 
i, h) p(h, h*)e(h*, 0:41) B(i + 1, h* 
ENa, = Y dp pal A (4.96) 
i=0 
E(N(h, oy} = X ati, MEG h) (4.97) 


en P[O = 0/6] 

c. The Baum—Welch algorithm efficiently estimates the parameters of an HMM [25]. 
Fitting these models has proven extremely useful in diverse applications including 
statistical genetics, signal processing and speech recognition, problems involving 
environmental time series, and Bayesian graphical networks [172, 236, 361, 392, 
523]. Starting from some initial values 6, the Baum—Welch algorithm proceeds 
via iterative application of the following update formulas: 


E{N(h)|0} 


her = 4, 
as Em EINA} oy 
E{N(h, h*)|0} 
h. h* (41) — 4. 
PRID = Sae ENG, OY" D 
(1) 
Ee = — EN N) (4.100) 


Doree EINCh, 0100} 


Prove that the Baum—Welch algorithm is an EM algorithm. It is useful to begin by 
noting that the complete data likelihood is given by 


[[-@ [[]][@. 0? [fT] 2@. oe”. (4.101) 
heH heH oc€ heH h*cH 


d. Consider the following scenario. In Flip’s left pocket is a penny; in his right pocket 
is a dime. On a fair toss, the probability of showing a head is p for the penny 


126 


CHAPTER 4 EM OPTIMIZATION METHODS 


and d for the dime. Flip randomly chooses a coin to begin, tosses it, and reports the 
outcome (heads or tails) without revealing which coin was tossed. Then, Flip decides 
whether to use the same coin for the next toss, or to switch to the other coin. He 
switches coins with probability s, and retains the same coin with probability 1 — s. 
The outcome of the second toss is reported, again not revealing the coin used. This 
process is continued for a total of 200 coin tosses. The resulting sequence of heads 
and tails is available from the website for this book. Use the Baum—Welch algorithm 
to estimate p, d, and s. 


e. Only for students seeking extra challenge: Derive the Baum—Welch algorithm for 
the case when the dataset consists of M independent observation sequences arising 
from a HMM. Simulate such data, following the coin example above. (You may wish 
to mimic the single-sequence data, which were simulated using p = 0.25,d = 0.85, 
and s = 0.1.) Code the Baum—Welch algorithm, and test it on your simulated data. 


In addition to considering multiple sequences, HMMs and the Baum—Welch algorithm 
can be generalized for estimation based on more general emission variables and emis- 
sion and transition probabilities that have more complex parameterizations, including 
time inhomogeneity. 


ve 


INTEGRATION AND 
SIMULATION 


Statisticians attempt to infer what is and what could be. To do this, we 
often rely on what is expected. In statistical contexts, expectations are usually 
expressed as integrals with respect to probability distributions. 

The value of an integral can be derived analytically or numerically. Since 
an analytic solution is usually impossible for all but the simplest statistical 
problems, a numerical approximation is often used. 

Numerical quadrature approximates the integral by systematically parti- 
tioning the region of integration into smaller parts, applying a simple approx- 
imation for each part, and then combining the results. We begin this portion 
of the book with coverage of quadrature techniques. 

The Monte Carlo method attacks the problem by simulating random 
realizations and then averaging these to approximate the theoretical average. 
We describe these methods and explore a variety of strategies for improving 
their performance. Markov chain Monte Carlo is a particularly important 
simulation technique, and we devote two chapters to such methods. 

Although we initially pose Monte Carlo methods as integration tools, 
it becomes increasingly apparent in these chapters that such probabilistic 
methods have broad utility for simulating random variates irrespective of how 
those simulations will be used. Our examples and exercises illustrate a variety 
of simulation applications. 
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CHAPTER 5 


NUMERICAL INTEGRATION 


Consider a one-dimensional integral of the form f n f(x) dx. The value of the integral 
can be derived analytically for only a few functions f. For the rest, numerical approx- 
imations of the integral are often useful. Approximation methods are well known by 
both numerical analysts [139, 353, 376, 516] and statisticians [409, 630]. 

Approximation of integrals is frequently required for Bayesian inference since 
a posterior distribution may not belong to a familiar distributional family. Integral 
approximation is also useful in some maximum likelihood inference problems when 
the likelihood itself is a function of one or more integrals. An example of this occurs 
when fitting generalized linear mixed models, as discussed in Example 5.1. 

To initiate an approximation of f? f(x)dx, partition the interval [a, b] into 
n subintervals, [x;, xi+1] for i=0,...,n—1, with x9 =a and x, = b. Then 
f z fœ) dx = Yd f na i+! f(x)dx. This composite rule breaks the whole integral into 
many smaller parts, but postpones the question of how to approximate any single part. 

The approximation of a single part will be made using a simple rule. Within the 
interval [x;, x;+1], insert m + 1 nodes, xij for j = 0,...,m. Figure 5.1 illustrates the 
relationships among the interval [a, b], the subintervals, and the nodes. In general, 
numerical integration methods require neither equal spacing of subintervals or nodes 
nor equal numbers of nodes within each subinterval. 

A simple rule will rely upon the approximation 


[ore da~ >> Auf (x4) (5.1) 
Xi j=0 


for some set of constants A;;. The overall integral is then approximated by the com- 
posite rule that sums (5.1) over all subintervals. 


5.1 NEWTON-CÔTES QUADRATURE 


A simple and flexible class of integration methods consists of the Newton—Cétes 
rules. In this case, the nodes are equally spaced in [x;, x;+1], and the same number 
of nodes is used in each subinterval. The Newton—Cotes approach replaces the true 
integrand with a polynomial approximation on each subinterval. The constants A;j are 
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| to | | 
Zi i2 ete. 

a= to zı = Tjo T2 = Tim etc. b= tn 
FIGURE 5.1 To integrate f between a and b, the interval is partitioned into n subintervals, 
[x;, xi+1], each of which is further partitioned using m + 1 nodes, x}, . . . , x7,,. Note that when 
m = Q, the subinterval [x;, x;,;] contains only one interior node, x% = x;. 


selected so that Xo Aij f x} equals the integral of an interpolating polynomial 


on [x;, x;+1] that matches the value of f at the nodes within this subinterval. The 
remainder of this section reviews common Newton—Cotes rules. 


5.1.1 Riemann Rule 


Consider the case when m = 0. Suppose we define x = xj, and Ajo = Xj41 — Xi: 
The simple Riemann rule amounts to approximating f on each subinterval by a 
constant function, f (x;), whose value matches that of f at one point on the interval. 
In other words, 
Xi+1 Xi+1 
fades | fE) dx= Gin = xi) f Œ). (5.2) 
Xi Xi 

The composite rule sums n such terms to provide an approximation to the integral 
over [a, b]. 

Suppose the x; are equally spaced so that each subinterval has the same length 
h = (b — a)/n. Then we may write x; = a + ih, and the composite rule is 


n—1 


fx) dxx hY f(a + ih) = Rn). (5.3) 
i=0 


Xi+1 
Xi 


As Figure 5.2 shows, this corresponds to the Riemann integral studied in ele- 
mentary calculus. Furthermore, there is nothing special about the left endpoints of 
the subintervals: We easily could have replaced f (x;) with f (x;i+1) in (5.2). 
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Riemann ; : Trapezoidal i l Simpson’ s 
m=0 i ' m=1 ' ! m= 2 


Ti = Lig Ti+ vi = Tio Tit1 = Tj Lir=Viqg Ty Lipt = Vig 


FIGURE 5.2 Approximation (dashed) to f (solid) provided on the subinterval [x;, x;+,], for 
the Riemann, trapezoidal, and Simpson’s rules. 


The approximation given by (5.3) converges to the true value of the integral 
as n — oo, by definition of the Riemann integral of an integrable function. If f is 
a polynomial of zero degree (i.e., a constant function), then f is constant on each 
subinterval, so the Riemann rule is exact. 

When applying the composite Riemann rule, it makes sense to calculate a 
sequence of approximations, say Ring), for an increasing sequence of numbers of 
subintervals, ng, ask = 1, 2, . . .. Then, convergence of Ring) can be monitored using 
an absolute or relative convergence criterion as discussed in Chapter 2. It is particu- 
larly efficient to use ng+1 = 2n x so that half the subinterval endpoints at the next step 
correspond to the old endpoints from the previous step. This avoids calculations of f 
that are effectively redundant. 


Example 5.1 (Alzheimer’s Disease) Data from 22 patients with Alzheimer’s dis- 
ease, an ailment characterized by progressive mental deterioration, are shown in 
Table 5.1. In each of five consecutive months, patients were asked to recall words 
from a standard list given previously. The number of words recalled by each patient 
was recorded. The patients in Table 5.1 were receiving an experimental treatment 
with lecithin, a dietary supplement. It is of interest to investigate whether memory 
improved over time. The data for these patients (and 25 control cases) are available 
from the website for this book and are discussed further in [155]. 

Consider fitting these data with a very simple generalized linear mixed model 
[69, 670]. Let Y;; denote the number of words recalled by the ith individual in the 
jth month, for i=1,...,22 and j=1,...,5. Suppose Yj;|A;; have independent 
Poisson(A;;) distributions, where the mean and variance of Y;; is Ajj. Let xj; = (1 we 
be a covariate vector: Aside from an intercept term, only the month is used as a 
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TABLE 5.1 Words recalled on five consecutive monthly tests for 22 Alzheimer’s patients receiving 
lecithin. 


Patient 
Month 1 2 3 4 5 6 7 8 9 10 11 
1 9 6 13 9 6 11 7 8 3 4 11 
2 12 7 18 10 7 11 10 18 3 10 10 
3 16 10 14 12 8 12 11 19 3 11 10 
4 17 15 21 14 9 14 12 19 7 17 15 
5 18 16 21 15 12 16 14 22 8 18 16 


Patient 


Month 12 13 14 15 16 17 18 19 20 21 22 


1 1 6 0 18 15 10 6 9 4 4 10 
2 3 7 3 18 15 14 6 9 3 13 11 
3 2 7 3 19 15 16 7 13 4 13 13 
4 4 9 4 22 18 17 9 16 7 16 17 
5 5 10 6 22 19 19 10 20 9 19 21 


predictor. Let B = (fo 61)" be a parameter vector corresponding to x. Then we model 
the mean of Yj; as 


dij = exp{x} B + vik, (5.4) 


where the y; are independent N(O, o2) random effects. This model allows separate 
shifts in A;; on the log scale for each patient, reflecting the assumption that there may 
be substantial between-patient variation in counts. This is reasonable, for example, if 
the baseline conditions of patients prior to the start of treatment varied. 

With this model, the likelihood is 


22 5 
L (6. oly) =T] [ooo all FOylài) Jou 


22 
=] £ (6. oly). (5.5) 
i=l 
where f(yij|Aij) is the Poisson density, (yi; 0, o2) is the normal density function 


with mean zero and variance A and Y is a vector of all the observed response values. 
The log likelihood is therefore 


1(B. oly) Sys (B.o7ly). (5.6) 
i=1 


where l; denotes the contribution to the log likelihood made by data from the ith 
patient. 
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N 
| 


pa 
1 


Integrand of (5.7) in Units of 10-4 


i T T 
—0.05 0 0.05 
Ya 


FIGURE 5.3 Example 5.1 seeks to integrate this function, which arises from a generalized 
linear mixed model for data on patients treated for Alzheimer’s disease. 


To maximize the log likelihood, we must differentiate 7 with respect to each 
parameter and solve the corresponding score equations. This will require a numerical 
root-finding procedure, since the solution cannot be determined analytically. In this 
example, we look only at one small portion of this overall process: the evaluation of 
dl; /dB; for particular given values of the parameters and for a single i and k. This 
evaluation would be repeated for the parameter values tried at each iteration of a 
root-finding procedure. 

Let i = 1 and k = 1. The partial derivative with respect to the parameter for 
monthly rate of change is dl, /dB, = (dL /dß1) / Lı, where Lı is implicitly defined 
in (5.5). Further, 


dL, d . 2 ° JJa: 


d 5 
= f Fano, D J] four) dyı 


5 5 


= / $7130, 09) |Y joy-a [] forlapan, 6D 


j=l gel 


where A; ; = exp{Bo + jB1 + yı}. The last equality in (5.7) follows from standard 
analysis of generalized linear models [446]. 

Suppose, at the very first step of optimization, we start with initial values 
B = (1.804, 0.165) and o7 = 0.015%. These starting values were chosen using simple 
exploratory analyses. Using these values for B and Oy the integral we seek in (5.7) 
has the integrand shown in Figure 5.3. The desired range of integration is the entire 
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TABLE5.2 Estimates of the integral in (5.7) using the Riemann rule with various 
numbers of subintervals. All estimates are multiplied by a factor of 10°. Errors 
for use in a relative convergence criterion are given in the final column. 


Subintervals Estimate Relative Error 
2 3.49388458 186769 
1.88761005959780 —0.46 
8 1.72890354401971 —0.084 
16 1.72889046749119 —0.0000076 
32 1.72889038608621 —0.000000047 
64 1.72889026784032 —0.000000068 
128 1.72889018400995 —0.000000048 
256 1.72889013551548 —0.000000028 
512 1.72889010959701 —0.000000015 


1024 1.72889009621830 —0.0000000077 


real line, whereas we have thus far only discussed integration over a closed inter- 
val. Transformations can be used to obtain an equivalent interval over a finite range 
(see Section 5.4.1), but to keep things simple here we will integrate over the range 
[—0.07, 0.085], within which nearly all of the nonnegligible values of the integrand lie. 

Table 5.2 shows the results of a series of Riemann approximations, along with 
running relative errors. The relative errors measure the change in the new estimated 
value of the integral as a proportion of the previous estimate. An iterative approx- 
imation strategy could be stopped when the magnitude of these errors falls below 
some predetermined tolerance threshold. Since the integral is quite small, a relative 
convergence criterion is more intuitive than an absolute criterion. 


5.1.2 Trapezoidal Rule 


Although the simple Riemann rule is exact if f is constant on [a, b], it can be quite slow 
to converge to adequate precision in general. An obvious improvement would be to 
replace the piecewise constant approximation by a piecewise mth-degree polynomial 
approximation. We begin by introducing a class of polynomials that can be used for 
such approximations. This permits the Riemann rule to be cast as the simplest member 
of a family of integration rules having increased precision as m increases. This family 
also includes the trapezoidal rule and Simpson’s rule (Section 5.1.3). 
Let the fundamental polynomials be 


p= JT] = (5.8) 


7 —X: 
k=0,k + j 1J ik 


for j = 0, ...,m. Then the function p;(x) = DD f (x) pij(x) is an mth-degree 
polynomial that interpolates f at all the nodes xj, ..., xj, 


shows such interpolating polynomials for m = 0, 1, and 2. 


in [x;, xi+1]. Figure 5.2 
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These polynomials are the basis for the simple approximation 


Xi+1 Xi+1 
F(x) de® J ET (5.9) 
=r) I ON pila) a (5.10) 
j=0 xi 
=) _Ayf (xj) (5.11) 
j=0 


for Ajj = f = ‘+! p;;(x)dx. This approximation replaces integration of an arbitrary 


; Í re . : : . pb Aa 
function f with polynomial integration. The resulting composite rule is f. a J (x) dx © 
ys Vino Aij f (x4) when there are m nodes on each subinterval. 


Letting m = 1 with x = x; and x7, = x;+1 yields the trapezoidal rule. In this 
case, 


X — Xj+1 X—Xj 
Pio(x) = ———— and piw) = ———. 
Xi — Xi+1 Xi+1 — Xi 


Integrating these polynomials yields Ajo = Aj, = (Xi+1 — xi) /2. Therefore, the 
trapezoidal rule amounts to 


b n—-1 es 
[feoaeyd (==) (Fæ + fæ). (5.12) 
@ i=0 


When [a, b] is partitioned into n subintervals of equal length h = (b — a)/n, then the 
trapezoidal rule estimate is 


n—-1 


: P h ih 4 b =T 5.13 
f1% x~ 5 f(a) + Te IO (n). (5.13) 


The name of this approximation arises because the area under f in each subin- 
terval is approximated by the area of a trapezoid, as shown in Figure 5.2. Note that 
f is approximated in any subinterval by a first-degree polynomial (i.e., a line) whose 
value equals that of f at two points. Therefore, when f itself is a line on [a, b], T(n) 
is exact. 


Example 5.2 (Alzheimer’s Disease, Continued) For small numbers of subinter- 
vals, applying the trapezoidal rule to the integral from Example 5.1 yields similar 
results to those from the Riemann rule because the integrand is nearly zero at the end- 
points of the integration range. For large numbers of subintervals, the approximation 
is somewhat better. The results are shown in Table 5.3. 
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TABLE 5.3 Estimates of the integral in (5.7) using the Trapezoidal rule with 
various numbers of subintervals. All estimates are multiplied by a factor of 10°. 
Errors for use in a relative convergence criterion are given in the final column. 


Subintervals Estimate Relative Error 
2 3.49387751694744 
4 1.88760652713768 —0.46 
8 1.72890177778965 —0.084 
16 1.72888958437616 —0.0000071 
32 1.72888994452869 0.00000021 
64 1.72889004706156 0.000000059 
128 1.72889007362057 0.000000015 
256 1.72889008032079 0.0000000039 
512 1.72889008 199967 0.00000000097 


1024 1.72889008241962 0.00000000024 


Suppose f has two continuous derivatives. Problem 5.1 asks you to show that 

Pi) = f(x) + FODE x) + ESEN E ți) ONT). (5.14) 
Subtracting the Taylor expansion of f about x; from (5.14) yields 

Pi) — f(x) = 5 f'E DE — x41) + On), (5.15) 


and integrating (5.15) over [x;, x;+1] shows the approximation error of the trapezoidal 
rule on the ith subinterval to be h? f’(x;)/12 + O(n~*). Thus 


b n 3 Ffy, 
T(n) — i) f(x)dx= 5° a + O(n-*) (5.16) 
4 i=1 
3 pn 
canga re + O(n?) (5.17) 
_ b- f'® -3 
= p tOn’) (5.18) 


for some & € [a, b] by the mean value theorem for integrals. Hence the leading term 
of the overall error is O(n7?). 


5.1.3 Simpson’s Rule 


Letting m = 2, xi) = Xi, xX% = (xi + xi+1) /2, and x% = xi+1ı in (5.8), we obtain 
Simpson’s rule. Problem 5.2 asks you to show that Ajọ = Aj2 = (xi+1 — xi) /6 and 
Aj, = 2 (Aio + Ajz). This yields the approximation 

ane Xi + Xi41 


fæ de HLT [ro +4f ( 
6 2 


) F Fann) (5.19) 


Xi 
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TABLE 5.4 Estimates of the integral in (5.7) using Simpson’s rule with various 
numbers of subintervals (and two nodes per subinterval). All estimates are 
multiplied by a factor of 105. Errors for use in a relative convergence criterion 
are given in the final column. 


Subintervals Estimate Relative Error 
2 1.35218286386776 
1.67600019467364 0.24 
8 1.72888551990500 0.032 
16 1.72889006457954 0.0000026 
32 1.72889008123918 0.0000000096 
64 1.72889008247358 0.0000000007 1 
128 1.72889008255419 0.000000000047 
256 1.72889008255929 0.0000000000029 
512 1.72889008255961 0.00000000000018 


1024 1.72889008255963 0.000000000000014 


for the (i + 1)thsubinterval. Figure 5.2 shows how Simpson’s rule provides a quadratic 
approximation to f on each subinterval. 

Suppose the interval [a, b] has been partitioned into n subintervals of equal 
length h = (b — a)/n, where n is even. To apply Simpson’s rule, we need an interior 
node in each [x;, xj+1]. Since n is even, we may adjoin pairs of adjacent subintervals, 
with the shared endpoint serving as the interior node of the larger interval. This 
provides n/2 subintervals of length 2h, for which 


n/2 


b ~ 
f fadin SY (Soud +4f ein) + feed) =8(§). 6w 


i=1 


Example 5.3 (Alzheimer’s Disease, Continued) Table 5.4 shows the results of 
applying Simpson’s rule to the integral from Example 5.1. One endpoint and an 
interior node were evaluated on each subinterval. Thus, for a fixed number of subin- 
tervals, Simpson’s rule requires twice as many evaluations of f as the Riemann or the 
trapezoidal rule. Following this example, we show that the precision of Simpson’s 
rule more than compensates for the increased evaluations. From another perspective, 
if the number of evaluations of f is fixed at n for each method, we would expect 
Simpson’s rule to outperform the previous approaches, if n is large enough. 


If f is quadratic on [a, b], then it is quadratic on each subinterval. Simpson’s 
rule approximates f on each subinterval by a second-degree polynomial that matches 
f at three points; therefore the polynomial is exactly f. Thus, Simpson’s rule exactly 
integrates quadratic f. 

Suppose f is smooth—but not polynomial—and we have n subintervals 
[x;, Xi41] of equal length 2h. To assess the degree of approximation in Simpson’s 
rule, we begin with consideration on a single subinterval, and denote the simple 
Simpson’s rule on that subinterval as Si(n) = (h/3) [f (x) + 4f (xi +A) + 
fxi + 2h)| . Denote the true value of the integral on that subinterval as /;. 
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We use the Taylor series expansion of f about x;, evaluated at x = x; + h and 
x = x; + 2h, to replace terms in S;(n). Combining terms, this yields 
Si(n) = hf (xi) + 2h? f (xi) + $h? f" (ai) 
+ BAAS" Oxi) + o T Oa) +o (5.21) 
Now let F(x) = f a f(t) dt. This function has the useful properties that 
F (xi) = 0, F (xj + 2h) = Jj, and F’(x) = f(x). Taylor series expansion of F about 
Xi, evaluated at x = x; + 2h, yields 
Li = 2hf (xi) + 2h? f (ai) + Fh? f" (xa) 
+ 3h f" (xi) + igh FO) + (5.22) 
Subtracting (5.22) from (5.21) yields Si(n) —I;=h? fl" Gay 90+---= 
O (n=). This is the error for Simpson’s rule on a single subinterval. Over the n 


subintervals that partition [a, b], the error is therefore the sum of such errors, namely 
O (n PAN: Note that Simpson’s rule therefore exactly integrates cubic functions, too. 


5.1.4 General kth-Degree Rule 


The preceding discussion raises the general question about how to determine a 
Newton-Côtes rule that is exact for polynomials of degree k. This would require 
constants co, . . . , cx that satisfy 


$ b—a 
a fide = fares (a+ 7E) + 


i(b — a) 
+ caf (a+ k ) +---+cx f(b) (5.23) 
for any polynomial f. Of course, one could follow the derivations shown above 
for m = k, but there is another simple approach. If a method is to be exact for all 
polynomials of degree k, then it must be exact for particular—and easily integrated— 
choices like 1, x, x2,...,x*. Thus, we may set up a system of k equations in k 
unknowns as follows: 


b 
[dea b- a= ate ten 


a 


i b2 — a 
x dx = 
a 2 


coa + c1 (a+ 


b 
J x* dx = ete. 
a 


b-a 


k 


) tab, 
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All that remains is to solve for the c; to derive the algorithm. This approach is some- 
times called the method of undetermined coefficients. 
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In general, low-degree Newton—Cétes methods are slow to converge. However, there 
is a very efficient mechanism to improve upon a sequence of trapezoidal rule estimates. 
Let T(n) denote the trapezoidal rule estimate of JS. H f(x)dx using n subintervals of 
equal length h = (b — a)/n, as given in (5.13). Without loss of generality, suppose 
a = 0 and b = 1. Then 


TA) = 4 f0) + 4 fC), 
T2)= 1 fo)+4F(4) +4 FC), 
T(4) = 3 £00) + FL F(4) + F(5) + £(3)] + gf, (5.24) 


and so forth. Noting that 
T(2) = 371) + 3F(3), 
T(4) = 472) + 1 [F(4) + FQ], (5.25) 


and so forth suggests the general recursion relationship 


T(2n) eS ey eel (5.26) 
n) = ~T(n)+ = i-— = . : 

2 PLAAN 2 
The Euler—Maclaurin formula (1.8) can be used to show that 


b 
Pin) = I fœ dx t+ ch? +0 (n=) (5.27) 


for some constant c1, and hence 


A p C12 4 
T(2n) a fade + Zh +0 (n7 ) i (5.28) 
Therefore, 


IA ok b 
soe) = ih f(xdx +0 (n), (5.29) 


so the h? error terms in (5.27) and (5.28) cancel. With this simple adjustment we have 
made a striking improvement in the estimate. In fact, the estimate given in (5.29) turns 
out to be Simpson’s rule with subintervals of width h/2. Moreover, this strategy can 
be iterated for even greater gains. 
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Begin by defining Tio =T (2") fori = 0,...,m.Then define a triangular array 
of estimates like 


using the relationship 


so WT pp ag 


T= Win] (5.30) 


forj =1,...,iandi = 1,..., m. Note that (5.30) can be reexpressed to calculate Te j 
by sae an increment equal to 1 /(4/ — 1) times the difference T j= Ts 1,j-1 
to the estimate given by Ti Ji 

If f has 2m continuous derivatives on [a, b], then the entries in the mth row of 
the array have error Tm. j= f? f(x) dx = O (2-7) for j < m [121, 376]. This is 
such fast convergence that very small m will often suffice. 

It is important to check that the Romberg calculations do not deteriorate as m 
is increased. To do this, consider the quotient 


Qj = ==. (5.31) 


The error in T; j İs attributable partially to the approximation strategy itself and par- 
tially to numerical imprecision introduced by computer roundoff. As long as the 
former source dominates, the Q;; values should approach 4/+! as i increases. How- 
ever, when computer roundoff error is substantial relative to approximation error, the 
Qij values will become erratic. The columns of the triangular array of Tij can be 
examined to determine the largest j for which the quotients appear to approach 4/+! 
before deteriorating. No further column should be used to calculate an update via 
(5.30). The following example illustrates the approach. 


Example 5.4 (Alzheimer’s Disease, Continued) Table 5.5 shows the results of 
applying Romberg integration to the integral from Example 5.1. The right columns of 
this table are used to diagnose the stability of the Romberg calculations. The top por- 
tion of the table corresponds to j = 0, and the T; j are the trapezoidal rule estimates 
given in Table 5.3. After some initial steps, the quotients in the top portion of the table 
converge nicely to 4. Therefore, it is safe and advisable to apply (5.30) to generate a 
second column of the triangular array. It is safe because the convergence of the quo- 
tients to 4 implies that computer roundoff error has not yet become a dominant source 
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TABLE5.5 Estimates of the integral in (5.7) using Romberg integration. All estimates and differences 
are multiplied by a factor of 105. The final two columns provide performance evaluation measures 
discussed in the text. 


A 


n~ 


i J Subintervals Tio Tis — Ti-1,; Qij 
1 0 2 3.4938775 1694744 
2 0 4 1.88760652713768 — 1.60627098980976 
3 0 8 1.72890177778965 —0.15870474934803 10.12 
4 0 16 1.72888958437616 —0.00001219341349 13015.61 
5 0 32 1.72888994452869 0.00000036015254 —33.86 
6 0 64 1.72889004706156 0.00000010253287 3.51 
7 0 128 1.72889007362057 0.00000002655901 3.86 
8 0 256 1.72889008032079 0.00000000670022 3.96 
9 0 512 1.72889008199967 0.00000000167888 3.99 
10 0 1024 1.72889008241962 0.00000000041996 4.00 
1 1 2 
2 1 4 1.35218286386776 
3 1 8 1.67600019467364 0.32381733080589 
4 1 16 1.72888551990500 0.05288532523136 6.12 
5 1 32 1.72889006457954 0.00000454467454 11636.77 
6 1 64 1.72889008123918 0.00000001665964 272.80 
7 1 128 1.72889008247358 0.00000000123439 13.50 
8 1 256 1.72889008255420 0.00000000008062 15.31 
9 1 512 1.72889008255929 0.00000000000510 15.82 
10 1 1024 1.72889008255961 0.00000000000032 16.14 
1 2 2 
2 2 4 
3 2 8 1.69758801672736 
4 2 16 1.73241 120825375 0.034823 19152639 
5 2 32 1.72889036755784 —0.00352084069591 —9.89 
6 2 64 1.72889008234983 —0.00000028520802 12344.82 
7 2 128 1.72889008255587 0.00000000020604 —1384.21 
8 2 256 1.72889008255957 0.00000000000370 55.66 
9 2 512 1.72889008255963 0.00000000000006 59.38 
10 2 1024 1.72889008255963 <0.00000000000001 —20.44 


of error. It is advisable because incrementing one of the current integral estimates by 
one-third of the corresponding difference would yield a noticeably different updated 


estimate. 
The second column of the triangular array is shown in the middle portion of 


Table 5.5. The quotients in this portion also appear reasonable, so the third column is 
calculated and shown in the bottom portion of the table. The values of Q;2 approach 
64, allowing more tolerance for larger j. At i = 10, computer roundoff error appears 
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to dominate approximation error because the quotient departs from near 64. However, 
note that incrementing the integral estimate by a of the difference at this point would 
have a negligible impact on the updated estimate itself. Had we proceeded one more 
step with the reasoning that the growing amount of roundoff error will cause little harm 
at this point, we would have found that the estimate was not improved and the resulting 
quotients clearly indicated that no further extrapolations should be considered. 
Thus, we may take To = 1.72889008255963 x 1075 to be the estimated value 
of the integral. In this example, we calculated the triangular array one column at a 
time, for m = 10. However, in implementation it makes more sense to generate the 
array one row at a time. In this case, we would have stopped after i = 9, obtaining a 
precise estimate with fewer subintervals—and fewer evaluations of f—than in any 
of the previous examples. 


The Romberg strategy can be applied to other Newton—Cotes integration rules. 


For example, if S (n) is the Simpson’s rule estimate of f H f(x) dx using n subintervals 
of equal length, then the analogous result to (5.29) is 


16S(2n) — S(n) f? 2% 
na ee J f@dx+O (n i (5.32) 


Romberg integration is a form of a more general strategy called Richardson 
extrapolation [325, 516]. 


5.3 GAUSSIAN QUADRATURE 


All the Newton—Cotes rules discussed above are based on subintervals of equal length. 
The estimated integral is a sum of weighted evaluations of the integrand on a regular 
grid of points. For a fixed number of subintervals and nodes, only the weights may 
be flexibly chosen; we have limited attention to choices of weights that yield exact 
integration of polynomials. Using m + 1 nodes per subinterval allowed mth-degree 
polynomials to be integrated exactly. 

An important question is the amount of improvement that can be achieved if the 
constraint of evenly spaced nodes and subintervals is removed. By allowing both the 
weights and the nodes to be freely chosen, we have twice as many parameters to use in 
the approximation of f. If we consider that the value of an integral is predominantly 
determined by regions where the magnitude of the integrand is large, then it makes 
sense to put more nodes in such regions. With a suitably flexible choice ofm + 1 nodes, 
X0, . - - , Xm, and corresponding weights, Ao, ..., Am, exact integration of 2(m + 1)th- 
degree polynomials can be obtained using f A f dx = Xio Ai f (xi). 

This approach, called Gaussian quadrature, can be extremely effective for inte- 
grals like f f(x)w(x) dx where w is a nonnegative function and h y x* w(x) dx < œ 
for all k > 0. These requirements are reminiscent of density function with finite 
moments. Indeed, it is often useful to think of w as a density, in which case inte- 
grals like expected values and Bayesian posterior normalizing constants are natural 


5.3 GAUSSIAN QUADRATURE 143 


candidates for Gaussian quadrature. The method is more generally applicable, how- 
ever, by defining f*(x) = f(x)/w(x) and applying the method to f? f* w(x) dx. 

The best node locations turn out to be the roots of a set of orthogonal polynomials 
that is determined by w. 


5.3.1 Orthogonal Polynomials 


Some background on orthogonal polynomials is needed to develop Gaussian quadra- 
ture methods [2, 139, 395, 620]. Let p(x) denote a generic polynomial of degree 
k. For convenience in what follows, assume that the leading coefficient of p(x) is 
positive. 

If f H f(x) w(x) dx < œ, then the function f is said to be square-integrable 
with respect to w on [a, b]. In this case we will write f € Latr For any f and g 
in L iati their inner product with respect to w on [a, b] is defined to be 


b 
($ 8)w, [a,b] = f f(x)g(x) w(x) dx. (5.33) 


If (f; 8) w,{a,b] = O, then f and g are said to be orthogonal with respect to w on [a, b]. 
If also f and g are scaled so that (f; f)w,[a,b] = (8, 8)w,[a,b] = 1, then f and g are 
orthonormal with respect to w on [a, b]. 

Given any w that is nonnegative on [a, b], there exists a sequence of polynomials 
{px(x)} Zo that are orthogonal with respect to w on [a, b]. This sequence is not unique 
without some form of standardization because (fg) w,fa,p] = Oimplies (cf, g) w,[a,b] = 
0 for any constant c. The canonical standardization for a set of orthogonal polynomials 
depends on w and will be discussed later; a common choice is to set the leading 
coefficient of px(x) equal to 1. For use in Gaussian quadrature, the range of integration 
is also customarily transformed from [a, b] to a range [a*, b*] whose choice depends 
on w. 

A set of standardized, orthogonal polynomials can be summarized by a recur- 
rence relation 


PR(X) = (tq + XBR) Pr—-1 0) — Yk Pr-2(x) (5.34) 


for appropriate choices of œk, Bx, and yx that vary with k and w. 

The roots of any polynomial in such a standardized set are all in (a*, b*). 
These roots will serve as nodes for Gaussian quadrature. Table 5.6 lists several sets of 
orthogonal polynomials, their standardizations, and their correspondences to common 
density functions. 


5.3.2 The Gaussian Quadrature Rule 


Standardized orthogonal polynomials like (5.34) are important because they deter- 
mine both the weights and the nodes for a Gaussian quadrature rule based on a chosen 
w. Let {pk} be a sequence of orthonormal polynomials with respect to w on 
[a, b] for a function w that meets the conditions previously discussed. Denote the 
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TABLE 5.6 Orthogonal polynomials, their standardizations, their correspondence to common 
density functions, and the terms used for their recursive generation. The leading coefficient of 
a polynomial is denoted cy. In some cases, variants of standard definitions are chosen for best 
correspondence with familiar densities. 


Ok 
Name Standardization Bx 
(Density) w(x) (a*, b*) Vk 
Jacobi! (1 — x)P4x97! ch = 1 See [2, 516] 
(Beta) (0, 1) 
Legendre* 1 pA) = 1 (1 — 2k)/k 
(Uniform) (0, 1) (4k — 2)/k 
(k —1)/k 
Laguerre exp{—x} ck = (—1)}*/k! (2k — 1)/k 
(Exponential) (0, co) —1/k 
(k —1)/k 
Laguerre? x” exp{—x} ck = (—1)*/k! (2k — 1 +r)/k 
(Gamma) (0, co) —1/k 
(k—1+r)/k 
Hermite exp{—x? /2} ch = 1 0 
(Normal) (—00, 00) 1 
k-1 
“Shifted. 
> Generalized. 
€ Alternative form. 
roots of pm+1(x) by a < x9 < -> < Xm < b. Then there exist weights Aọ, ..., Am 
such that: 
1. A; > Ofori =0,...,m. 
2. Aji = —Cm42 / [Cm-+1 Pm+2 (xi) Piet (xi)] , where cx is the leading coefficient 


of p(x). 


3... is f(x)w(x) dx = X`; o Ai f (xi) whenever f is a polynomial of degree not 
exceeding 2m + 1. In other words, the method is exact for the expectation of 
any such polynomial with respect to w. 


4. If f is 2(m + 1) times continuously differentiable, then 


b m (2m+2) 
f (&) 
| f(x)w(x) dx — 3 Aif x) = Omid Ti Dle? (5.35) 


for some & € (a, b). 


The proof of this result may be found in [139]. 
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TABLE 5.7 Estimates of the integral in (5.7) using Gauss-Hermite quadrature 
with various numbers of nodes. All estimates are multiplied by a factor of 10°. 
Errors for use in a relative convergence criterion are given in the final column. 


Nodes Estimate Relative Error 

2 1.72893306163335 

3 1.72889399083898 —0.000023 

4 1.72889068827101 —0.0000019 

5 1.72889070910131 0.000000012 

6 1.72889070914313 0.000000000024 

7 1.72889070914166 — 0.00000000000085 

8 1.72889070914167 —0.000000000000007 1 


Although this result and Table 5.6 provide the means by which the nodes and 
weights for an (m + 1)-point Gaussian quadrature rule can be calculated, one should 
be hesitant to derive these directly, due to potential numerical imprecision. Numeri- 
cally stable calculations of these quantities can be obtained from publicly available 
software [228, 489]. Alternatively, one can draw the nodes and weights from pub- 
lished tables like those in [2, 387]. Lists of other published tables are given in [139, 
630]. 

Of the choices in Table 5.6, Gauss—Hermite quadrature is particularly useful 
because it enables integration over the entire real line. The prominence of normal 
distributions in statistical practice and limiting theory means that many integrals 
resemble the product of a smooth function and a normal density; the usefulness of 
Gauss—Hermite quadrature in Bayesian applications is demonstrated in [478]. 


Example 5.5 (Alzheimer’s Disease, Continued) Table 5.7 shows the results of 
applying Gauss—Hermite quadrature to estimate the integral from Example 5.1. Using 
the Hermite polynomials in this case is particularly appealing because the integrand 
from Example 5.1 really should be integrated over the entire real line, rather than the 
interval (—0.07, 0.085). Convergence was extremely fast: With 8 nodes we obtained 
a relative error half the magnitude of that achieved by Simpson’s rule with 1024 
nodes. The estimate in Table 5.7 differs from previous examples because the range 
of integration differs. Applying Gauss—Legendre quadrature to estimate the integral 
over the interval (—0.07, 0.085) yields an estimate of 1.72889008255962 x 1075 using 
26 nodes. 


Gaussian quadrature is quite different from the Newton—Cotes rules discussed 
previously. Whereas the latter rely on potentially enormous numbers of nodes to 
achieve sufficient precision, Gaussian quadrature is often very precise with a re- 
markably small number of nodes. However, for Gaussian quadrature the nodes for 
an m-point rule are not usually shared by an (m + k)-point rule for k > 1. Recall 
the strategy discussed for Newton—Cotes rules where the number of subintervals is 
sequentially doubled so that half the new nodes correspond to old nodes. This is not 
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effective for Gaussian quadrature because each increase in the number of nodes will 
require a separate effort to generate the nodes and the weights. 


5.4 FREQUENTLY ENCOUNTERED PROBLEMS 


This section briefly addresses strategies to try when you are faced with a problem more 
complex than a one-dimensional integral of a smooth function with no singularities 
on a finite range. 


5.4.1 Range of Integration 


Integrals over infinite ranges can be transformed to a finite range. Some useful trans- 
formations include 1/x, exp{x}/(1 + exp{x}), exp{—x}, and x/(1 + x). Any cumu- 
lative distribution function is a potential basis for transformation, too. For example, 
the exponential cumulative distribution function transforms the positive half line to 
the unit interval. Cumulative distribution functions for real-valued random variables 
transform doubly infinite ranges to the unit interval. Of course, transformations to 
remove an infinite range may introduce other types of problems such as singularities. 
Thus, among the options available, it is important to choose a good transformation. 
Roughly speaking, a good choice is one that produces an integrand that is as nearly 
constant as can be managed. 

Infinite ranges can be dealt with in other ways, too. Example 5.5 illustrates 
the use of Gauss—Hermite quadrature to integrate over the real line. Alternatively, 
when the integrand vanishes near the extremes of the integration range, integra- 
tion can be truncated with a controllable amount of error. Truncation was used in 
Example 5.1. 

Further discussion of transformations and strategies for selecting a suitable one 
are given in [139, 630] 


5.4.2 Integrands with Singularities or Other Extreme Behavior 


Several strategies can be employed to eliminate or control the effect of singularities 
that would otherwise impair the performance of an integration rule. 

Transformation is one approach. For example, consider Jo (exp(x} //x) dx, 
which has a singularity at 0. The integral is easily fixed using the transformation 
u = \/x, yielding 2 fi exp{u} du. 

The integral fe x??? exp{x} dx has no singularity on [0, 1] but is very difficult to 
estimate directly with a Newton—Cotes approach. Transformation is helpful in such 
cases, too. Letting u = x!° yields Jo exp{u!/1000} du, whose integrand is nearly 
constant on [0, e]. The transformed integral is much easier to estimate reliably. 

Another approach is to subtract out the singularity. For example, consider inte- 
grating f T. log{ sin? x} dx, which has a singularity at 0. By adding and subtracting 
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away the square of the log singularity at zero, we obtain ie log{(sin? x)/x?} dx + 


J. Ta log{x?} dx. The first term is then suitable for quadrature, and elementary 
methods can be used to derive that the second term equals 27[log(z/2) — 1]. 

Refer to [139, 516, 630] for more detailed discussions of how to formulate an 
appropriate strategy to address singularities. 


5.4.3 Multiple Integrals 


The most obvious extensions of univariate quadrature techniques to multiple inte- 
grals are product formulas. This entails, for example, writing f 5 J. i f(x, y) dy dx as 
JS i g(x) dx where g(x) = f £ f(x, y) dy. Values of g(x) could be obtained via univariate 


quadrature approximations to f i f(x, y) dy for a grid of x values. Univariate quadra- 
ture could then be completed for g. Using n subintervals in each univariate quadrature 
would require n?” evaluations of f, where p is the dimension of the integral. Thus, 
this approach is not feasible for large p. Even for small p, care must be taken to avoid 
the accumulation of a large number of small errors, since each exterior integral de- 
pends on the values obtained for each interior integral at a set of points. Also, product 
formulas can only be implemented directly for regions of integration that have simple 
geometry, such as hyperrectangles. 

To cope with higher dimensions and general multivariate regions, one may 
develop specialized grids over the region of integration, search for one or more di- 
mensions that can be integrated analytically to reduce the complexity of the problem, 
or turn to multivariate adaptive quadrature techniques. Multivariate methods are dis- 
cussed in more detail in [139, 290, 516, 619]. 

Monte Carlo methods discussed in Chapters 6 and 7 can be employed to estimate 
integrals over high-dimensional regions efficiently. For estimating a one-dimensional 
integral based on n points, a Monte Carlo estimate will typically have a convergence 
rate of O(n—'/*), whereas the quadrature methods discussed in this chapter converge 
at O(n~7) or faster. In higher dimensions, however, the story changes. Quadrature ap- 
proaches are then much more difficult to implement and slower to converge, whereas 
Monte Carlo approaches generally retain their implementation ease and their conver- 
gence performance. Accordingly, Monte Carlo approaches are generally preferred for 
high-dimensional integration. 


5.4.4 Adaptive Quadrature 


The principle of adaptive quadrature is to choose subinterval lengths based on the local 
behavior of the integrand. For example, one may recursively subdivide those existing 
subintervals where the integral estimate has not yet stabilized. This can be a very 
effective approach if the bad behavior of the integrand is confined to a small portion 
of the region of integration. It also suggests a way to reduce the effort expended for 
multiple integrals because much of the integration region may be adequately covered 
by a very coarse grid of subintervals. A variety of ideas is covered in [121, 376, 630]. 
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5.4.5 Software for Exact Integration 


This chapter has focused on integrals that do not have analytic solutions. For most of 
us, there is a class of integrals that have analytic solutions that are so complex as to 
be beyond our skills, patience, or cleverness to derive. Numerical approximation will 
work for such integrals, but so will symbolic integration tools. Software packages 
such as Mathematica [671] and Maple [384] allow the user to type integrands in a 
syntax resembling many computer languages. The software interprets these algebraic 
expressions. With deft use of commands for integrating and manipulating terms, the 
user can derive exact expressions for analytic integrals. The software does the algebra. 
Such software is particularly helpful for difficult indefinite integrals. 


PROBLEMS 


5.1. For the trapezoidal rule, express p;(x) as 


fist) — fa) 


Xi41 — Xi 


f(x) + (& — xi) 


Expand f in Taylor series about x; and evaluate this at x = x;,;. Use the resulting 
expression in order to prove (5.14). 


5.2. Following the approach in (5.8)—(5.11), derive Aj; for j = 0, 1, 2 for Simpson’s rule. 


5.3. Suppose the data (x1, ..., x7) = (6.52, 8.32, 0.31, 2.82, 9.96, 0.14, 9.64) are observed. 
Consider Bayesian estimation of u based on a N(u, 37/7) likelihood for the minimally 
sufficient x | u, and a Cauchy(5,2) prior. 


a. Using a numerical integration method of your choice, show that the proportional- 
ity constant is roughly 7.84654. (In other words, find k such that f k x (prior) x 
(likelihood) du = 1.) 

b. Using the value 7.84654 from (a), determine the posterior probability that2 < u < 8 
using the Riemann, trapezoidal, and Simpson’s rules over the range of integration 
{implementing Simpson’s rule as in (5.20) by pairing adjacent subintervals]. Com- 
pute the estimates until relative convergence within 0.0001 is achieved for the slow- 
est method. Table the results. How close are your estimates to the correct answer of 
0.99605? 


c. Find the posterior probability that u > 3 in the following two ways. Since the range 
of integration is infinite, use the transformation u = exp{u}/(1 + exp{u}). First, 
ignore the singularity at 1 and find the value of the integral using one or more 
quadrature methods. Second, fix the singularity at 1 using one or more appropriate 
strategies, and find the value of the integral. Compare your results. How close are 
the estimates to the correct answer of 0.99086? 


d. Use the transformation u = 1/j, and obtain a good estimate for the integral in 
part (c). 


5.4. Let X ~ Unif[1,a] and Y = (a — 1)/X, for a> 1. Compute E{Y} = loga using 
Romberg’s algorithm for m = 6. Table the resulting triangular array. Comment on 
your results. 
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TABLE 5.8 Nodes and weights for 10-point Gauss- 
Legendre quadrature on the range [—1, 1]. 


CX; Aj 


0.148874338981631 0.295524224714753 
0.433395394129247 0.269266719309996 
0.679409568299024 0.219086362515982 
0.865063366688985 0.149451394150581 
0.973906528517172 0.066671344308688 


5.5. The Gaussian quadrature rule having w(x) = 1 for integrals on [—1, 1] (cf. Table 5.6) 
is called Gauss—Legendre quadrature because it relies on the Legendre polynomials. 
The nodes and weights for the 10-point Gauss—Legendre rule are given in Table 5.8. 


5.6. 


a. 
b. 


Plot the weights versus the nodes. 


Find the area under the curve y = x” between —1 and 1. Compare this with the 
exact answer and comment on the precision of this quadrature technique. 


Suppose 10 i.i.d. observations result in x = 47. Let the likelihood for u correspond to 
the model X | u ~ N(w, 50/10), and the prior for (u — 50)/8 be Student’s ¢ with 1 
degree of freedom. 


a. 


Show that the five-point Gauss—Hermite quadrature rule relies on the Hermite poly- 
nomial Hs(x) = c(x> — 10x? + 15x). 


. Show that the normalization of Hs(x) [namely, (Hs(x), Hs(x)) = 1] requires c = 


1/ J 120/27. You may wish to recall that a standard normal distribution has odd 
moments equal to zero and rth moments equal to r! /[(/2)!2"/?] when r is even. 


. Using your favorite root finder, estimate the nodes of the five-point Gauss—Hermite 


quadrature rule. (Recall that finding a root of f is equivalent to finding a local 
minimum of | f|.) Plot H5(x) from —3 to 3 and indicate the roots. 


. Find the quadrature weights. Plot the weights versus the nodes. You may appreciate 


knowing that the normalizing constant for H¢(x) is 1/1/ 720V 22. 


. Using the nodes and weights found above for five-point Gauss—Hermite integration, 


estimate the posterior variance of u. (Remember to account for the normalizing 
constant in the posterior before taking posterior expectations.) 


CHAPTER 6 


SIMULATION AND MONTE CARLO 
INTEGRATION 


This chapter addresses the simulation of random draws Xj,...,X, from a 
target distribution f. The most frequent use of such draws is to perform Monte 
Carlo integration, which is the statistical estimation of the value of an integral using 
evaluations of an integrand at a set of points drawn randomly from a distribution with 
support over the range of integration [461]. 

Estimation of integrals via Monte Carlo simulation can be useful in a wide 
variety of settings. In Bayesian analyses, posterior moments can be written in the 
form of an integral but typically cannot be evaluated analytically. Posterior probabil- 
ities can also be written as the expectation of an indicator function with respect to 
the posterior. The calculation of risk in Bayesian decision theory relies on integra- 
tion. Integration is also an important component in frequentist likelihood analyses. 
For example, marginalization of a joint density relies upon integration. Example 5.1 
illustrates an integration problem arising from the maximum likelihood fit of a gen- 
eralized linear mixed model. A variety of other integration problems are discussed 
here and in Chapter 7. 

Aside from its application to Monte Carlo integration, simulation of random 
draws from a target density f is important in many other contexts. Indeed, Chapter 7 
is devoted to a specific strategy for Monte Carlo integration called Markov chain 
Monte Carlo. Bootstrap methods, stochastic search algorithms, and a wide variety of 
other statistical tools also rely on generation of random deviates. 

Further details about the topics discussed in this chapter can be found in [106, 
158, 190, 374, 383, 417, 432, 469, 539, 555, 557]. 


6.1 INTRODUCTION TO THE MONTE CARLO METHOD 


Many quantities of interest in inferential statistical analyses can be expressed as the 
expectation of a function of a random variable, say E{h(X)}. Let f denote the density 
of X, and u denote the expectation of A(X) with respect to f. When an i.i.d. random 


sample X;,..., X, is obtained from f, we can approximate u by a sample average: 
1 n 
faye = PAX > frof dx = u 6.1) 
i=1 
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as n — œ, by the strong law of large numbers (see Section 1.6). Further, let v(x) = 
[h(x) — ul’, and assume that A(X)? has finite expectation under f. Then the sampling 
variance Of ÛÔypc is o*/n = E{v(X)/n}, where the expectation is taken with respect 
to f. A similar Monte Carlo approach can be used to estimate o? by 


i ae 
Vai faye} = — D PA) - auc] (6.2) 


i=1 


When o° exists, the central limit theorem implies that ĜÊmc has an approximate normal 
distribution for large n, so approximate confidence bounds and statistical inference 
for u follow. Generally, it is straightforward to extend (6.1), (6.2), and most of the 
methods in this chapter to cases when the quantity of interest is multivariate, so it 
suffices hereafter to consider u to be scalar. 

Monte Carlo integration provides slow O(n~!/*) convergence. With n nodes, 
the quadrature methods described in Chapter 5 offer convergence of order O(n~7) or 
better. There are several reasons why Monte Carlo integration is nonetheless a very 
powerful tool. 

Most importantly, quadrature methods are difficult to extend to multidimen- 
sional problems because general p-dimensional space is so vast. Straightforward 
product rules creating quadrature grids of size n? quickly succumb to the curse of 
dimensionality (discussed in Section 10.4.1), becoming harder to implement and 
slower to converge. Monte Carlo integration samples randomly from f over the 
p-dimensional support region of f, but does not attempt any systematic exploration 
of this region. Thus, implementation of Monte Carlo integration is less hampered by 
high dimensionality than is quadrature. However, when p is large, a very large sample 
size may still be required to obtain an acceptable standard error for yc. Quadra- 
ture methods also perform best when h is smooth, even when p = 1. In contrast, 
the Monte Carlo integration approach ignores smoothness. Further comparisons are 
offered in [190]. 

Monte Carlo integration replaces the systematic grid of quadrature nodes witha 
set of points chosen randomly from a probability distribution. The first step, therefore, 
is to study how to generate such draws. This topic is addressed in Sections 6.2 and 
6.3. Methods for improving upon the standard estimator given in Equation (6.1) are 
described in Section 6.4. 


6.2 EXACT SIMULATION 


Monte Carlo integration motivates our focus on simulation of random variables that do 
not follow a familiar parametric distribution. We refer to the desired sampling density 
f as the target distribution. When the target distribution comes from a standard 
parametric family, abundant software exists to easily generate random deviates. At 
some level, all of this code relies on the generation of standard uniform random 
deviates. Given the deterministic nature of the computer, such draws are not really 
random, but a good generator will produce a sequence of values that are statistically 
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indistinguishable from independent standard uniform variates. Generation of standard 
uniform random deviates is a classic problem studied in [195, 227, 383, 538, 539, 557]. 

Rather than rehash the theory of uniform random number generation, we focus 
on the practical quandary faced by those with good software: what should be done 
when the target density is not one easily sampled using the software? For example, 
nearly all Bayesian posterior distributions are not members of standard parametric 
families. Posteriors obtained when using conjugate priors in exponential families are 
exceptions. 

There can be additional difficulties beyond the absence of an obvious method 
to sample f. In many cases—especially in Bayesian analyses—the target density 
may be known only up to a multiplicative proportionality constant. In such cases, f 
cannot be sampled and can only be evaluated up to that constant. Fortunately, there 
are a variety of simulation approaches that still work in this setting. 

Finally, it may be possible to evaluate f, but computationally expensive. If each 
computation of f(x) requires an optimization, an integration, or other time-consuming 
computations, we may seek simulation strategies that avoid direct evaluation of f as 
much as possible. 

Simulation methods can be categorized by whether they are exact or approxi- 
mate. The exact methods discussed in this section provide samples whose sampling 
distribution is exactly f. Later, in Section 6.3, we introduce methods producing sam- 
ples from a distribution that approximates f. 


6.2.1 Generating from Standard Parametric Families 


Before discussing sampling from difficult target distributions, we survey some strate- 
gies for producing random variates from familiar distributions using uniform random 
variates. We omit justifications for these approaches, which are given in the refer- 
ences cited above. Table 6.1 summarizes a variety of approaches. Although the tabled 
approaches are not necessarily state of the art, they illustrate some of the underlying 
principles exploited by sophisticated generators. 


6.2.2 Inverse Cumulative Distribution Function 


The methods for the Cauchy and exponential distributions in Table 6.1 are justi- 
fied by the inverse cumulative distribution function or probability integral trans- 
form approach. For any continuous distribution function F, if U ~ Unif(0, 1), then 
X = F~!(U) = inf{x : F(x) > U} has a cumulative distribution function equal to F. 

If F7! is available for the target density, then this strategy is probably the 
simplest option. If F~! is not available but F is either available or easily approximated, 
then a crude approach can be built upon linear interpolation. Using a grid of x1, ..., Xm 
spanning the region of support of f, calculate or approximate u; = F(x;) at each grid 
point. Then, draw U ~ Unif(0, 1) and linearly interpolate between the two nearest 
grid points for which u; < U < uj according to 


ya aa ee A (6.3) 
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Although this approach is not exact, we include it in this section because the degree of 
approximation is deterministic and can be reduced to any desired level by increasing 
m sufficiently. Compared to the alternatives, this simulation method is not appealing 
because it requires a complete approximation to F regardless of the desired sample 
size, it does not generalize to multiple dimensions, and it is less efficient than other 
approaches. 


6.2.3 Rejection Sampling 


If f(x) can be calculated, at least up to a proportionality constant, then we can use 
rejection sampling to obtain a random draw from exactly the target distribution. This 
strategy relies on sampling candidates from an easier distribution and then correcting 
the sampling probability through random rejection of some candidates. 

Let g denote another density from which we know how to sample and for which 
we can easily calculate g(x). Let e(-) denote an envelope, having the property e(x) = 
g(x)/a > f(x) for all x for which f(x) > 0 for a given constant a < 1. Rejection 
sampling proceeds as follows: 


1. Sample Y ~ g. 

2. Sample U ~ Unif(0, 1). 

3. Reject Y if U > f(Y)/e(Y). In this case, do not record the value of Y as an 
element in the target random sample. Instead, return to step 1. 


4. Otherwise, keep the value of Y. Set X = Y, and consider X to be an element 
of the target random sample. Return to step | until you have accumulated a 
sample of the desired size. 


The draws kept using this algorithm constitute an i.i.d. sample from the target density 
J; there is no approximation involved. To see this, note that the probability that a kept 
draw falls at or below a value y is 


_ fY) 
px y= P|y <ylu = a 

= fY) fY) 

=r frssmu AD] / re 2] 
y felz) œ p f(z)/e(z) 

= / | du g(z) dz / / f du g(z)dz (6.4) 
—œ J0 —œ J0 
y 

= f(z) dz, (6.5) 


which is the desired probability. Thus, the sampling distribution is exact, and œ can 
be interpreted as the expected proportion of candidates that are accepted. Hence a is 
a measure of the efficiency of the algorithm. We may continue the rejection sampling 
procedure until it yields exactly the desired number of sampled points, but this requires 
a random total number of iterations that will depend on the proportion of rejections. 


156 CHAPTER6 SIMULATION AND MONTE CARLO INTEGRATION 


Reject 


Keep 


0 


y 
FIGURE 6.1 Illustration of rejection sampling for a target distribution f using a rejection 
sampling envelope e. 


Recall the rejection rule in step 3 for determining the fate of a candidate draw, 
Y = y. Sampling U ~ Unif(0, 1) and obeying this rule is equivalent to sampling 
U|y ~ Unif(0, e(y)) and keeping the value yif U < f(y). Consider Figure 6.1. Sup- 
pose the value y falls at the point indicated by the vertical bar. Then imagine sampling 
U|Y = y uniformly along the vertical bar. The rejection rule eliminates this Y draw 
with probability proportional to the length of the bar above f(y) relative to the overall 
bar length. Therefore, one can view rejection sampling as sampling uniformly from the 
two-dimensional region under the curve e and then throwing away any draws falling 
above f and below e. Since sampling from f is equivalent to sampling uniformly 
from the two-dimensional region under the curve labeled f(x) and then ignoring the 
vertical coordinate, rejection sampling provides draws exactly from f. 

The shaded region in Figure 6.1 above f and below e indicates the waste. The 
draw Y = y is very likely to be rejected when e(y) is far larger than f(y). Envelopes 
that exceed f everywhere by at most a slim margin produce fewer wasted (i.e., 
rejected) draws and correspond to a values near 1. 

Suppose now that the target distribution f is only known up to a proportionality 
constant c. That is, suppose we are only able to compute easily g(x) = f(x)/c, where 
c is unknown. Such densities arise, for example, in Bayesian inference when f is a 
posterior distribution known to equal the product of the prior and the likelihood scaled 
by some normalizing constant. Fortunately, rejection sampling can be applied in such 
cases. We find an envelope e such that e(x) > q(x) for all x for which q(x) > 0. A 
draw Y = yisrejected when U > q(y)/e(y). The sampling probability remains correct 
because the unknown constant c cancels out in the numerator and denominator of (6.4) 
when f is replaced by q. The proportion of kept draws is a/c. 

Multivariate targets can also be sampled using rejection sampling, provided that 
a suitable multivariate envelope can be constructed. The rejection sampling algorithm 
is conceptually unchanged. 
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To produce an envelope we must know enough about the target to bound it. 
This may require optimization or a clever approximation to f or q in order to ensure 
that e can be constructed to exceed the target everywhere. Note that when the target is 
continuous and log-concave, it is unimodal. If we select x; and x2 on opposite sides 
of that mode, then the function obtained by connecting the line segments that are 
tangent to log f or logg at x; and x2 yields a piecewise exponential envelope with 
exponential tails. Deriving this envelope does not require knowing the maximum of 
the target density; it merely requires checking that x; and x2 lie on opposite sides of 
it. The adaptive rejection sampling method described in Section 6.2.3.2 exploits this 
idea to generate good envelopes. 

To summarize, good rejection sampling envelopes have three properties: They 
are easily constructed or confirmed to exceed the target everywhere, they are easy to 
sample, and they generate few rejected draws. 


Example 6.1 (Gamma _ Deviates) Consider the problem of generating a 
Gamma(r, 1) random variable when r > 1. When Y is generated according to the 
density 


tTI) exp{—1(y)} 


fO)= TO (6.6) 


for f(y) = a(1 + by) for—1/b < y < œ,a =r — $,andb = 1/v9a, then X = 1(Y) 
will have a Gamma(r, 1) distribution [443]. Marsaglia and Tsang describe how to use 
this fact in a rejection sampling framework [444]. Adopt (6.6) as the target distribution 
because transforming draws from f gives the desired gamma draws. 

Simplifying f and ignoring the normalizing constant, we wish to generate from 
the density that is proportional to g(y) = exp{a log{t(y)/a} — t(y) + a}. Conveniently, 
q fits snugly under the function e(y) = exp{—y?/2}, which is the unscaled stan- 
dard normal density. Therefore, rejection sampling amounts to sampling a standard 
normal random variable, Z, and a standard uniform random variable, U, then setting 
X = t(Z) if 


Z z? (Z 
us sof E taiog {2} az +ah (6.7) 


and t(Z) > 0. Otherwise, the draw is rejected and the process begun anew. An accepted 
draw has density Gamma(r, 1). Draws from Gamma(r, 1) can be rescaled to obtain 
draws from Gamma(r, À). 

In a simulation when r = 4, over 99% of candidate draws are accepted and a 
plot of e(y) and g(y) against y shows that the two curves are nearly superimposed. 
Even in the worst case (r = 1), the envelope is excellent, with less than 5% waste. 


Example 6.2 (Sampling a Bayesian Posterior) Suppose 10 independent obser- 
vations (8, 3,4, 3, 1, 7,2, 6,2, 7) are observed from the model X;|A ~ Poisson(A). 
A lognormal prior distribution for A is assumed: log A ~ N(log 4, 0.57). Denote the 
likelihood as L(A|x) and the prior as f(A). We know that k =X = 4.3 maximizes 
L(A|x) with respect to A; therefore the unnormalized posterior, q(A|x) = f(A)L(A|x) 
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FIGURE 6.2 Unnormalized target (dotted) and envelope (solid) for rejection sampling in 
Example 6.2. 


is bounded above by e(A) = f(A)L(4.3|x). Figure 6.2 shows q and e. Note that the 
prior is proportional to e. Thus, rejection sampling begins by sampling A; from 
the lognormal prior and U; from a standard uniform distribution. Then A; is kept 
if U; < qQj|x)/e(A;) = L(A;|x)/L(4.3|x). Otherwise, A; is rejected and the process 
is begun anew. Any kept A; is a draw from the posterior. Although not efficient—only 
about 30% of candidate draws are kept—this approach is easy and exact. 


6.2.3.1 Squeezed Rejection Sampling Ordinary rejection sampling requires 
one evaluation of f for every candidate draw Y. In cases where evaluating f is 
computationally expensive but rejection sampling is otherwise appealing, improved 
simulation speed is achieved by squeezed rejection sampling [383, 441, 442]. 

This strategy preempts the evaluation of f in some instances by employing a 
nonnegative squeezing function, s. For s to be a suitable squeezing function, s(x) must 
not exceed f(x) anywhere on the support of f. An envelope, e, is also used; as with 
ordinary rejection sampling, e(x) = g(x)/a > f(x) on the support of f. 

The algorithm proceeds as follows: 


1. Sample Y ~ g. 

2. Sample U ~ Unif(0, 1). 

3. If U < s(Y)/e(Y), keep the value of Y. Set X = Y and consider X to be an 
element in the target random sample. Then go to step 6. 


4. Otherwise, determine whether U < f(Y)/e(Y). If this inequality holds, keep 
the value of Y, setting X = Y. Consider X to be an element in the target random 
sample; then go to step 6. 


5. If Y has not yet been kept, reject it as an element in the target random sample. 


6. Return to step 1 until you have accumulated a sample of the desired size. 
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FIGURE 6.3 Illustration of squeezed rejection sampling for a target distribution, f, using 
envelope e and squeezing function s. Keep First and Keep Later correspond to steps 3 and 4 of 
the algorithm, respectively. 


Note that when Y = y, this candidate draw is kept with overall probability f(y)/e(y), 
and rejected with probability [e(y) — f(y)]/e(y). These are the same probabilities as 
with simple rejection sampling. Step 3 allows a decision to keep Y to be made on 
the basis of an evaluation of s, rather than of f. When s nestles up just underneath f 
everywhere, we achieve the largest decrease in the number of evaluations of f. 

Figure 6.3 illustrates the procedure. When a candidate Y = y is sampled, the 
algorithm proceeds in a manner equivalent to sampling a Unif(O, e(y)) random vari- 
able. If this uniform variate falls below s(y), the candidate is kept immediately. The 
lighter shaded region indicates where candidates are immediately kept. If the can- 
didate is not immediately kept, then a second test must be employed to determine 
whether the uniform variate falls under f(y) or not. Finally, the darker shaded region 
indicates where candidates are ultimately rejected. 

As with rejection sampling, the proportion of candidate draws kept is a. The 
proportion of iterations in which evaluation of f is avoided is f: s(x) dx / f e(x) dx. 

Squeezed rejection sampling can also be carried out when the target is known 
only up to a proportionality constant. In this case, the envelope and squeezing function 
sandwich the unnormalized target. The method is still exact, and the same efficiency 
considerations apply. 

Generalizations for sampling multivariate targets are straightforward. 


6.2.3.2 Adaptive Rejection Sampling Clearly the most challenging aspect of 
the rejection sampling strategy is the construction of a suitable envelope. Gilks and 
Wild proposed an automatic envelope generation strategy for squeezed rejection sam- 
pling for a continuous, differentiable, log-concave density on a connected region of 
support [244]. 
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FIGURE 6.4 Piecewise linear outer and inner hulls for (x) = log f(x) used in adaptive 
rejection sampling when k = 5. 


The approach is termed adaptive rejection sampling because the envelope and 
squeezing function are iteratively refined concurrently with the generation of sample 
draws. The amount of waste and the frequency with which f must be evaluated both 
shrink as iterations increase. 

Let £(x) = log f(x), and assume f(x) > 0 on a (possibly infinite) interval of 
the real line. Let f be log-concave in the sense that £(a) — 24(b) + £(c) < 0 for 
any three points in the support region of f for which a < b < c. Under the additional 
assumptions that f is continuous and differentiable, note that ¢’(x) exists and decreases 
monotonically with increasing x, but may have discontinuities. 

The algorithm is initiated by evaluating £ and Z’ at k points, x) < x2 < -+> < Xk. 
Let Tk = {x1,..., xx}. If the support of f extends to —oo, choose x; such that 
L'(x1) > 0. Similarly, if the support of f extends to œo, choose x; such that g'(x) < 0. 

Define the rejection envelope on 7; to be the exponential of the piecewise linear 
upper hull of £ formed by the tangents to £ at each point in 7;. If we denote the upper 
hull of £ as ež, then the rejection envelope is ex(x) = exp{e;(x)}. To understand the 
concept of an upper hull, consider Figure 6.4. This figure shows £ with a solid line 
and illustrates the case when k = 5. The dashed line shows the piecewise upper hull, 
e*. It is tangent to £ at each x;, and the concavity of £ ensures that e% lies completely 
above £ everywhere else. One can show that the tangents at x; and x; intersect at 


ppe LiD) — Li) — xi41 l i) + xil i) 


i 6.8 
Ui) — l'it) oe 


for i = 1,..., k — 1. Therefore, 


Ka) = Lai) + æ xM a) for x € [zi-1, zi] (6.9) 
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FIGURE 6.5 Envelopes and squeezing function for adaptive rejection sampling. The target 
density is the smooth, nearly bell-shaped curve. The first method discussed in the text, using 
the derivative of £, produces the envelope shown as the upper boundary of the lighter shaded 
region. This corresponds to Equation (6.9) and Figure 6.4. Later in the text, a derivative- 
free method is presented. That envelope is the upper bound of the darker shaded region and 
corresponds to (6.11) and Figure 6.6. The squeezing function for both approaches is given by the 
dotted curve. 


and į = 1, ..., k, with zo and zg defined, respectively, to equal the (possibly infinite) 
lower and upper bounds of the support region for f. Figure 6.5 shows the envelope 
ek exponentiated to the original scale. 

Define the squeezing function on 7; to be the exponential of the piecewise linear 
lower hull of £ formed by the chords between adjacent points in T. This lower hull 
is given by 


(xi41 — W(x) + x — x) E41) 
Xi+1 — Xi 


for x € [x;, x41] (6.10) 


sa) = 


andi=1,...,k—1. When x < xı or x > xx, let s(x) = —oo. Thus the squeezing 
function is s(x) = exp{s;(x)}. Figure 6.4 shows a piecewise linear lower hull, s(x), 
when k = 5. Figure 6.5 shows the squeezing function s; on the original scale. 

Figures 6.4 and 6.5 illustrate several important features of the approach. Both 
the rejection envelope and the squeezing function are piecewise exponential functions. 
The envelope has exponential tails that lie above the tails of f. The squeezing function 
has bounded support. 

Adaptive rejection sampling is initialized by choosing a modest k and a corre- 
sponding suitable grid T. The first iteration of the algorithm proceeds as for squeezed 
rejection sampling, using e and sg as the envelope and squeezing function, respec- 
tively. When a candidate draw is accepted, it may be accepted without evaluating £ 
and ¢’ at the candidate if the squeezing criterion was met. However, it may also be 
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accepted at the second stage, where evaluation of £ and £’ at the candidate is required. 
When a candidate is accepted at this second stage, the accepted point is added to 
the set Tg, creating 7,1. Updated functions egķ+1 and 5x41 are also calculated. Then 
iterations continue. When a candidate draw is rejected, no update to Tx, ex, or Sk is 
made. Further, we see now that a new point that matches any existing member of Tk 
provides no meaningful update to Tk, ex, or Sx. 

Candidate draws are taken from the density obtained by scaling the piecewise 
exponential envelope ex so that it integrates to 1. Since each accepted draw is made 
using a rejection sampling approach, the draws are an i.i.d. sample precisely from f. 
If f is known only up to a multiplicative constant, the adaptive rejection sampling 
approach may still be used, since the proportionality constant merely shifts £, eZ, 
and sz. 

Gilks and co-authors have developed a similar approach that does not require 
evaluation of ¢' [237, 240]. We retain the assumptions that f is log-concave with a 
connected support region, along with the basic notation and setup for the tangent- 
based approach above. 

For the set of points Tg, define L;(-) to be the straight line function connecting 
(xi, €(4;)) and (xi+1, &(xi+1)) fori = 1,...,k — 1. Define 


min{L;-1(x), Li¢1(X)} for x € [x;, x41], 
ex(x) = s Li(x) for x < x1, (6.11) 
Lg—1(x) for x > Xk, 


with the convention that Lo(x) = L(x) = oo. Then e% is a piecewise linear upper 
hull for £ because the concavity of £ ensures that L;(x) lies below £(x) on (xi, xi+1) 
and above £(x) when x < x; or x > xi+1. The rejection sampling envelope is then 
ex(x) = exp{e œ). 

The squeezing function remains as in (6.10). Iterations of the derivative-free 
adaptive rejection sampling algorithm proceed analogously to the previous approach, 
with Tz, the envelope, and the squeezing function updated each time a new point is 
kept. 

Figure 6.6 illustrates the derivative-free adaptive rejection sampling algorithm 
for the same target shown in Figure 6.4. The envelope is not as efficient as when ¢’ is 
used. Figure 6.5 shows the envelope on the original scale. The lost efficiency is seen 
on this scale, too. 

Regardless of the method used to construct eg, notice that one would prefer 
the Tg grid to be most dense in regions where f(x) is largest, near the mode of f. 
Fortunately, this will happen automatically, since such points are most likely to be 
kept in subsequent iterations and included in updates to T. Grid points too far in the 
tails of f, such as x5, are not very helpful. 

Software for the tangent-based approach is available in [238]. The derivative- 
free approach has been popularized by its use in the WinBUGS software for carrying 
out Markov chain Monte Carlo algorithms to facilitate Bayesian analyses [241, 243, 
610]. Adaptive rejection sampling can also be extended to densities that are not log- 
concave, for example, by applying Markov chain Monte Carlo methods like those in 
Chapter 7 to further correct the sampling probabilities. One strategy is given in [240]. 
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FIGURE 6.6 Piecewise linear outer and inner hulls for £(x) = log f(x) used in derivative-free 
adaptive rejection sampling when k = 5. 


6.3 APPROXIMATE SIMULATION 


Although the methods described above have the appealing feature that they are exact, 
there are many cases when an approximate method is easier or perhaps the only 
feasible choice. Despite how it might sound, Approximation is not a critical flaw 
of these methods because the degree of approximation can be controlled by user- 
specified parameters in the algorithms. The simulation methods in this section are 
all based to some extent on the sampling importance resampling principle, which we 
discuss first. 


6.3.1 Sampling Importance Resampling Algorithm 


The sampling importance resampling (SIR) algorithm simulates realizations approx- 
imately from some target distribution. SIR is based upon the notion of importance 
sampling, discussed in detail in Section 6.4.1. Briefly, importance sampling proceeds 
by drawing a sample from an importance sampling function, g. Informally, we will 
call g an envelope. Each point in the sample is weighted to correct the sampling prob- 
abilities so that the weighted sample can be related to a target density f. For example, 
the weighted sample can be used to estimate expectations under f. 

Having graphed some univariate targets and envelopes in the early part of this 
chapter to illustrate basic concepts, we shift now to multivariate notation to emphasize 
the full generality of techniques. Thus, X = (X1, . . ., Xp) denotes a random vector 
with density f(x), and g(x) denotes the density corresponding to a multivariate enve- 
lope for f. 
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For the target density f, the weights used to correct sampling probabilities are 
called the standardized importance weights and are defined as 


ies f(&i)/8(Xi) (6.12) 
o Drea FED) 
for a collection of values x1, . . . , Xm drawn i.i.d. from an envelope g. Although not 


necessary for general importance sampling, it is useful to standardize the weights as 
in (6.12) so they sum to 1. When f = cq for some unknown proportionality constant 
c, the unknown c cancels in the numerator and denominator of (6.12). 

We may view importance sampling as approximating f by the discrete distribu- 
tion having mass w(x;) on each observed point x; for i = 1, ..., m. Rubin proposed 
sampling from this distribution to provide an approximate sample from f [559, 560]. 
The SIR algorithm therefore proceeds as follows: 

1. Sample candidates Y4, ..., Y, i.i.d. from g. 

2. Calculate the standardized importance weights, w(Y1),..., w(Ym). 

3. Resample X,...,X, from Y;,..., Y with replacement with probabilities 
w(¥1),..., w(¥m). 


A random variable X drawn with the SIR algorithm has distribution that con- 
verges to f as m — oo. To see this, define w*(y) = f(y)/g(y), let Y1,..., Ym ~ 
i.i.d. g, and consider a set A. Then 


m 


PIX € AlY1,...,¥m]= So lpyeayw*(¥) / X w*(¥)). (6.13) 
i=l =l 


The strong law of large numbers gives 
1 m 
-D vea Y) > E {lygu Y} = T w*(y)g(y)dy (6.14) 
i=1 
as m — oo. Further, 
1 m 
— X u" (Y) > E{w*(¥)} =1 (6.15) 
m i=1 
as m — oo. Hence, 
PIX € AlVi,...,¥m] > i: w(y)gty) dy = i f) dy (6.16) 
as m — oo. Finally, we note that 
PIX € A] = E{P[X€AlY1,...,Y¥m]} > i. fy) dy (6.17) 
A 


by Lebesgue’s dominated convergence theorem [49, 595]. The proof is similar when 
the target and envelope are known only up to a constant [555]. 
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Although both SIR and rejection sampling rely on the ratio of target to en- 
velope, they differ in an important way. Rejection sampling is perfect, in the sense 
that the distribution of a generated draw is exactly f, but it requires a random num- 
ber of draws to obtain a sample of size n. In contrast, the SIR algorithm uses a 
pre-determined number of draws to generate an n-sample but permits a random 
degree of approximation to f in the distribution of the sampled points. 

When conducting SIR, it is important to consider the relative sizes of the initial 
sample and the resample. These sample sizes are m and n, respectively. In principle, 
we require n/m — 0 for distributional convergence of the sample. In the context 
of asymptotic analysis of Monte Carlo estimates based on SIR, where n — on, this 
condition means that m — oo even faster than n — oo. For fixed n, distributional 
convergence of the sample occurs as m — ov, therefore in practice one obviously 
wants to initiate SIR with the largest possible m. However, one faces the competing 
desire to choose n as large as possible to increase the inferential precision. The maxi- 
mum tolerable ratio n/m depends on the quality of the envelope. We have sometimes 
foundn/m < $ tolerable so long as the resulting resample does not contain too many 
replicates of any initial draw. 

The SIR algorithm can be sensitive to the choice of g. First, the support of g 
must include the entire support of f if a reweighted sample from g is to approximate 
a sample from f. Further, g should have heavier tails than f, or more generally g 
should be chosen to ensure that f(x)/g(x) never grows too large. If g(x) is nearly 
zero anywhere where f(x) is positive, then a draw from this region will happen only 
extremely rarely, but when it does it will receive a huge weight. 

When this problem arises, the SIR algorithm exhibits the symptom that one or 
a few standardized importance weights are enormous compared to the other weights, 
and the secondary sample consists nearly entirely of replicated values of one or a 
few initial draws. When the problem is not too severe, taking the secondary resample 
without replacement has been suggested [220]. This is asymptotically equivalent to 
sampling with replacement, but has the practical advantage that it prevents exces- 
sive duplication. The disadvantage is that it introduces some additional distributional 
approximation in the final sample. When the distribution of weights is found to be 
highly skewed, it is probably wiser to switch to a different envelope or a different 
sampling strategy altogether. 

Since SIR generates X;,..., X, approximately i.i.d. from f, one may proceed 
with Monte Carlo integration such as estimating the expectation of A(X) by ûs = 
Soy A(X;)/n as in (6.1). However, in Section 6.4 we will introduce superior ways 
to use the initial weighted importance sample, along with other powerful methods to 
improve Monte Carlo estimation of integrals. 


Example 6.3 (Slash Distribution) The random variable Y has a slash distribution 
if Y = X/U where X ~ N(0, 1) and U ~ Unif(0, 1) independently. Consider using 
the slash distribution as a SIR envelope to generate standard normal variates, and 
conversely using the normal distribution as a SIR envelope to generate slash variates. 
Since it is easy to simulate from both densities using standard methods, SIR is not 
needed in either case, but examining the results is instructive. 
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FIGURE 6.7 The left panel shows a histogram of approximate draws from a standard normal 
density obtained via SIR with a slash distribution envelope. The right panel shows a histogram 
of approximate draws from a slash density obtained via SIR using a normal envelope. The solid 
lines show the target densities. 


The slash density function is 


1 — exp{—y*/2} 


fo) = yya 
1 


Na ran 
This density has very heavy tails. Therefore, it is a fine importance sampling function 
for generating draws from a standard normal distribution using SIR. The left panel of 
Figure 6.7 illustrates the results when m = 100,000 and n = 5000. The true normal 
density is superimposed for comparison. 

On the other hand, the normal density is not a suitable importance sampling 
function for SIR use when generating draws from the slash distribution because the 
envelope’s tails are far lighter than the target’s. The right panel of Figure 6.7 (where, 
again, m = 100,000 and n = 5000) illustrates the problems that arise. Although the 
tails of the slash density assign appreciable probability as far as 10 units from the 
origin, no candidate draws from the normal density exceeded 5 units from the origin. 
Therefore, beyond these limits, the simulated tails of the target have been completely 
truncated. Further, the most extreme candidate draws generated have far less density 
under the normal envelope than they do under the slash target, so their importance 
ratios are extremely high. This leads to abundant resampling of these points in the 
tails. Indeed, 528 of the 5000 values selected by SIR are replicates of the three lowest 
unique values in the histogram. 


Example 6.4 (Bayesian Inference) Suppose that we seek a sample from the pos- 
terior distribution from a Bayesian analysis. Such a sample could be used to provide 
Monte Carlo estimates of posterior moments, probabilities, or highest posterior den- 
sity intervals, for example. Let f(@) denote the prior, and L(@|x) the likelihood, so 
the posterior is f(6|x) = cf(@)L(@|x) for some constant c that may be difficult to 
determine. If the prior does not seriously restrict the parameter region favored by 
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the data via the likelihood function, then perhaps the prior can serve as a useful 
importance sampling function. Sample 01, ..., Om i.i.d. from f(@). Since the target 
density is the posterior, the ith unstandardized weight equals L(6;|x). Thus the SIR 
algorithm has a very simple form: Sample from the prior, weight by the likelihood, 
and resample. 

For instance, recall Example 6.2. In this case, importance sampling could be- 
gin by drawing A1,..., Am ~ iid. Lognormal(log 4, 0.57). The importance weights 
would be proportional to L(A;|x). Resampling from 41, ..., Àm with replacement 
with these weights yields an approximate sample from the posterior. 


6.3.1.1 Adaptive Importance, Bridge, and Path Sampling In some circum- 
stances, one initially may be able to specify only a very poor importance sampling 
envelope. This may occur, for example, when the target density has support nearly 
limited to a lower-dimensional space or surface due to strong dependencies between 
variables not well understood by the analyst. In other situations, one may wish to 
conduct importance sampling for a variety of related problems, but no single enve- 
lope may be suitable for all the target densities of interest. In situations like this, it is 
possible to adapt the importance sampling envelope. 

One collection of ideas for envelope improvement is termed adaptive impor- 
tance sampling. An initial sample of size mı is taken from an initial envelope e1. 
This sample is weighted (and possibly resampled) to obtain an initial estimate of 
quantities of interest or an initial view of f itself. Based on the information obtained, 
the envelope is improved, yielding e2. Further importance sampling and envelope 
improvement steps are taken as needed. When such steps are terminated, it is most 
efficient to use the draws from all the steps, along with their weights, to formulate 
suitable inference. Alternatively, one can conduct quick envelope refinement during 
several initial steps, withholding the majority of simulation effort to the final stage 
and limiting inference to this final sample for simplicity. 

In parametric adaptive importance sampling, the envelope is typically assumed 
to belong to some family of densities indexed by a low-dimensional parameter. The 
best choice for the parameter is estimated at each iteration, and the importance sam- 
pling steps are iterated until estimates of this indexing parameter stabilize [189, 381, 
490, 491, 606]. In nonparametric adaptive importance sampling, the envelope is often 
assumed to be a mixture distribution, such as is generated with the kernel density 
estimation approach in Chapter 10. Importance sampling steps are alternated with 
envelope updating steps, adding, deleting, or modifying mixture components. Exam- 
ples include [252, 657, 658, 680]. Although potentially useful in some circumstances, 
these approaches are overshadowed by Markov chain Monte Carlo methods like those 
described in Chapter 7, because the latter are usually simpler and at least as effective. 

A second collection of ideas for envelope improvement is relevant when a single 
envelope is inadequate for the consideration of several densities. In Bayesian statistics 
and certain marginal likelihood and missing data problems, one is often interested 
in estimating a ratio of normalizing constants for a pair of densities. For example, if 
f,(O|x) = cjqg;(O|x) is the ith posterior density for 6 (fori = 1, 2) under two competing 
models, where q; is known but c; is unknown, then r = c2/c, is the posterior odds 
for model 1 compared to model 2. The Bayes factor is the ratio of r to the prior odds. 
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Since it is often difficult to find good importance sampling envelopes for both fı 
and f2, one standard importance sampling approach is to use only a single envelope 
to estimate r. For example, in the convenient case when the support of f2 contains that 
of fı and we are able to use fz as the envelope, r = E{q1(6|x)/q2(0|x)}. However, 
when fı and fz differ greatly, such a strategy will perform poorly because no single 
envelope can be sufficiently informative about both cı and c2. The strategy of bridge 
sampling employs an unnormalized density, qpriage, that is, in some sense, between 
qi and q2 [457]. Then noting that 


n E prldoridge (81x) /q2(01x)} (6.18) 

E fi {dbridge(|X)/q1 (A1x)} 
we may employ importance sampling to estimate the numerator and the denominator, 
thus halving the difficulty of each task, since Gpridge is nearer to each q; than the 
qi are to each other. These ideas have been extensively studied for Bayesian model 
choice [437]. 

In principle, the idea of bridging can be extended by iterating the strategy 
employed in (6.18) with a nested sequence of intermediate densities between qı and 
q2. Each neighboring pair of densities in the sequence between qı and q2 would be 
close enough to enable reliable estimation of the corresponding ratio of normalizing 
constants, and from those ratios one could estimate r. In practice, it turns out that the 
limit of such a strategy amounts to a very simple algorithm termed path sampling. 
Details are given in [222]. 


6.3.2 Sequential Monte Carlo 


When the target density f becomes high dimensional, SIR is increasingly inefficient 
and can be difficult to implement. Specifying a very good high-dimensional envelope 
that closely approximates the target with sufficiently heavy tails but little waste can 
be challenging. Sequential Monte Carlo methods address the problem by splitting the 
high-dimensional task into a sequence of simpler steps, each of which updates the 
previous one. 

Suppose that X1., = (X1,..., X+) represents a discrete time stochastic process 
with X; being the observation at time t and Xj.; representing the entire history of 
the sequence thus far. The X; may be multidimensional, but for simplicity we adopt 
scalar notation here when possible. Write the density of X1.; as fr. Suppose that we 
wish to estimate at time f the expected value of h(X1.;) with respect to f; using an 
importance sampling strategy. 

A direct application of the SIR approach from Section 6.3.1 would be to draw 
a sample of x1.; sequences from an envelope g; and then calculate the importance 
weighted average of this sample of h(x,.;) values. However, this overlooks a key 
aspect of the problem. As t increases, X1., and the expected value of h(x1.;) evolve. 
At time ¢ it would be better to update previous inferences than to act as if we had no 
previous information. Indeed, it would be very inefficient to start the SIR approach 
from scratch at each time ¢. Instead we will develop a strategy that will enable us 
to append the simulated X, to the X1.;-1 previously simulated and to adjust the 
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previous importance weights in order to estimate the expected value of h(X1:+). Such 
an approach is called sequential importance sampling [419]. 

The attractiveness of such incremental updating is particularly apparent when 
sequential estimation must be completed in real time. For example, when applied to 
object tracking, sequential importance sampling can be used to estimate the current 
location of an object (e.g., a missile) from noisy observations made by remote sensors 
(e.g., radar). At any time, the true location is estimated from the current observation 
and the sequence of past observations (i.e., the estimated past trajectory). As track- 
ing continues at each time increment, the estimation must keep up with the rapidly 
incoming data. A wide variety of sequential estimation techniques have been applied 
to tackle this tracking problem [95, 239, 287, 447]. 

Applications of sequential importance sampling are extremely diverse, spanning 
sciences from physics to molecular biology, and generic statistical problems from 
Bayesian inference to the analysis of sparse contingency tables [108, 109, 167, 419]. 


6.3.2.1 Sequential Importance Sampling for Markov Processes Let us 
begin with the simplifying assumption that X4., is a Markov process. In this case, X; 
depends only on X;_ rather than the whole history Xj.;-;. Then the target density 
J:(X1-1) may be expressed as 


A(X) = fi) foro 11:1) f3 381:2): fer |X 1-1) 
= fix) flx) f3(%3|x2) > +> fr |xr-1). (6.19) 


Suppose that we adopt the same Markov form for the envelope, namely 


BCX 1:1) = g1 x1 )g2(x2|x1)83(x3|x2) © +» BAX] 7-1). (6.20) 


Using the ordinary nonsequential SIR algorithm (Section 6.3.1), at time ¢ one 
would sample from g;(x1.;) and reweight each x4., value by w, = ft(X1-1)/97(X1:1). 
Using (6.19) and (6.20), we see that w; = uju2---u;, where uw, = f{(x1)/g1(x1) and 
ui = fixilxi-1)/gi(xilxi-1) for i = 2,...,t. 

Given xj.;-; and w;—1, we can take advantage of the Markov property by 
sampling only the next component, namely X;, appending the value to xj.;-1, and 
adjusting w;—1 using the multiplicative factor ur. Specifically, when the target and 
envelope distributions are Markov, a sequential importance sampling approach for 
obtaining at each time a sequence X;:, and corresponding weight w is given in the 
steps below. A sample of n such points and their weights can be used to approximate 
Ji(%1-1) and hence the expected value of h(X1.+). The algorithm is: 


1. Sample Xı ~ gı. Let wy = u1 = fi(x1)/g1 (41). Set t = 2. 

. Sample X;|x;-1 ~ Be(X1|X1-1). 

. Append x; to x1.;—1, obtaining xj.;. 

© Let uy = fixt xr—1)/8Œlx1-1). 

. Let w; = w;—1u;. At the current time, wy is the importance weight for x1.;. 


Nn bh Ww NY 


. Increment ¢ and return to step 2. 
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For obtaining an independent sample of n draws xy, i=1,...,n, the above algo- 
rithm can be carried out by treating the n sequences one at a time or as a batch. Using 
this sample of size n, the weighted average )>/_ wncx?/ aan wi? serves as the 
estimate of E ¢,h(X1:1). 

It is not necessary to standardize the weights at the end of each cycle above, al- 
though when we seek to estimate E f h(Xj.;) the normalization is natural. Section 6.4.1 
discusses weight normalization in a more general context. Section 6.3.2.5 describes 
a generalization of sequential importance sampling that involves resampling with 
normalized weights between cycles. 


Example 6.5 (Simple Markov Process) Let X;|X;-1 ~ f; denote a Markov pro- 
cess where 


1 
fi(xrlxr—1) & | cos{X; — X1} exp {= 4% - x1} (6.21) 


For each t we wish to obtain an importance-weighted sample of $o with importance 


weights w® fori = 1, ...,n. Suppose we wish to estimate o; the standard deviation 
of X, using the weighted estimator 


1/2 

“a OO _ ii y 

of = q wi (x; (6.22) 
a Xw ee > 


where ji; = X] wx pee w®. 

Using sequential importance sampling, at stage t we can begin by sampling 
from a normal envelope xO ~N Ge b 1.52). Due to the Markov property, the 
weight updates are 


; PN) 
leos {xf OL x® jed- (x? -x2,) ja} 
© 
x 
ur o(x1?: x, 1.5?) 
2 Xi s 


where $(z; a, b) denotes the normal density for z with mean a and variance b. 


Thus to update the (t — 1)th sample, the x® are drawn, appended to the cor- 
responding past sequences to form x) with the respective weights updated accord- 
ing to wl? = = Ta We may then estimate o; using (6.22). At t = 100 we find 
©, = 13.4. By comparison, when X;|x;-1 ~ N(x;-1, 22), then no sequential impor- 
tance sampling is needed and the analogous &; = 9.9. Thus, the cosine term in (6.21) 
contributes additional variance to the distribution f. 

This example is sufficiently simple that other approaches are possible, but our 
sequential importance sampling approach is very straightforward. However, as t in- 
creases, the weights develop an undesirable degeneracy discussed in Section 6.3.2.3. 
Implementing the ideas there makes this example more effective. 


6.3.2.2 General Sequential Importance Sampling The task of obtaining an 
approximate sample from /;(X1:1) was greatly simplified in Section 6.3.2.1 because 
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of the Markov assumption. Suppose now that we want to achieve the same goal in a 
case where the Markov property does not apply. Then the target density is 


FD) = fir) fo(va | X11) £343 1K 1-2) - ++ fre lX11-1), (6.23) 


noting X1:1 = x1. Similarly, let us drop the Markov property of the envelope, 
permitting 


SX) = 8101) 82(%2|%1:1)$3(%31K1:2) ++ + Sr(%r X71). (6.24) 
The importance weights then take the form 


iens Fix) fa(x2181:1) f3 (x3181:2) © fxlX1:11) (6.25) 


g1 (x1)82(x2|X1:1)83(x3|X1:2) °° > fr&lX1:1-1) 


and the recursive updating expression for the importance weights is 


feilX 1:11) 
wA) = wE (6.26) 
81(%1/X1 1-1) 
for t > 1. Example 6.6 presents an application of the approach described here for 
non-Markov sequences. First, however, we consider a potential problem with the 
sequential weights. 


6.3.2.3 Weight Degeneracy, Rejuvenation, and Effective Sample Size As 
the importance weights are updated at each time, it is increasingly likely that the 
majority of the total weight will be concentrated in only a few sample sequences. The 
reason for this is that each component of a sequence must be reasonably consistent 
with each corresponding conditional density f;(x;|X1::-1) as time progresses. Each 
instance when an unusual (i.e., unlikely) new component is appended to a sample 
sequence, this proportionally reduces the weight for that entire sequence. Eventually 
few—if any—sequences have avoided such pitfalls. We say that the weights are 
increasingly degenerate as they become concentrated on only a few sequences Xj.;. 
Degenerating weights degrade estimation performance. 

To identify such problems, the effective sample size can be used to measure 
the efficiency of using the envelope g with target f [386, 418]. This assesses the 
degeneracy of the weights. 

Degeneracy of the weights is related to their variability: as weights are in- 
creasingly concentrated on a few samples with the remaining weights near zero, the 
variance of the weights will increase. A useful measure of variability in the weights 
is the (squared) coefficient of variation (CV) given by 


E{ w(X) — E{w(x)}}? 
(E{w(X)}) 


_ EfwX) -nY 
SELA E 


ev*{w(X)} = 


n 
E{nw(X) — 1)’. (6.27) 
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Next we will identify how this quantity can be incorporated into a measure of weight 
degeneracy. 

The effective sample size can be interpreted as indicating that the n weighted 
samples used in an importance sampling estimate are worth a certain number of 
unweighted i.i.d. samples drawn exactly from f. Suppose that n samples have nor- 
malized importance weights w(x) fori = 1,...,n. If z of these weights are zero 
and the remaining n — z are equal to 1/(n — z), then estimation effectively relies on 
the n — z points with nonzero weights. 

In this case, the coefficient of variation in (6.27) can be estimated as 


~2 il 7 K 172 

wl} = Dp )— 1] 
= Nda +S day 

k So Sı 


n ] 
z+(n—z) (*.-1) 
n-Z 
n 


= =] (6.28) 


n-Z 


where d(x) = nw(x) — 1 and the samples are partitioned based on whether the 
weight is 0 or 1 using So = {i: w(x”) = 0} and Sı = {i: w(x) = 1/(n — g)}. 
Therefore n — z = n/(1 + KWY. Moreover, by the nature of the weights we 
are considering, it is intuitive that the effective sample size should be n — z since that 
is the number of nonzero weights. Thus we can measure the effective sample size as 


n 


NO i 
eed? 1+. 87 {w(x} 


(6.29) 
Larger effective sample sizes are better, while smaller ones indicate increasing de- 
generacy. The notation N(g, f) is used to stress that the effective sample size is a 
measure of the quality of g as an envelope for f. In the case where the weights have 
not been standardized, the equivalent expression is 


n 


Applying (6.29) to calculate N(g, f) from the standardized weights is straight- 
forward since 
G?{w(X)} = IS [nwa®) = i 
n 


i=1 
ao [rie 282 


=n y w(x — 1 (6.31) 
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FIGURE 6.8 Illustration of sequential importance sampling. Time progresses from left to 
right. Samples are indicated by solid lines. Densities are shaded boxes. Points are shown with 
circles; the area of the circle is an indication of the sequence’s weight at that time. The dashed 
line indicates an instance of weight regeneration through sequence resampling. See the text for 
a detailed discussion. 
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(6.32) 


In the case of sequential importance sampling, we may monitor N;(g, f) to 
assess importance weight degeneracy. Typically, the effective sample size will de- 
crease as time progresses. At time t when the effective sample size falls below some 
threshold, the collection of sequences Xi i=1,...,n, should be rejuvenated. 
The simplest rejuvenation approach is to resample the sequences with replacement 
with probabilities equal to the w(x?) and then reset all weights to 1/n. Inclusion 
of this multinomial resampling step is sometimes called sequential importance sam- 
pling with resampling [421] and is closely related to the notion of particle filters which 
follows in Section 6.3.2.5. We illustrate the approach in several examples below. A va- 
riety of more sophisticated approaches for reducing degeneration and implementing 
rejuvenation are cited in Sections 6.3.2.4 and 6.3.2.5. 

Figure 6.8 illustrates the concepts discussed in the last two subsections. Time 
increases from left to right. The five shaded boxes and corresponding axes represent 
some hypothetical univariate conditional densities (like histograms turned sideways) 
for sampling the X;|x1.;-1. For each f, let g; be a uniform density over the range for 
X; covered by the shaded boxes. Starting at t = 1 there are points x® fori € {1, 2, 3}, 
represented by the three small circles on the leftmost edge of Figure 6.8. Their current 
weights are represented by the sizes of the circles; initially the weights are equal 
because gı is uniform. To initiate a cycle of the algorithm, the three points receive 
weights proportional to f1(x1)/g1(x1). The weighted points are shown just touching 
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the x; axis. The point in the dominant region of fı receives proportionally more 
weight than the other two, so its circle has grown larger. The effect is dampened in the 
figure because the weights are rescaled for visual presentation to prevent the circles 
from growing too large to fit in the graph or too small to see. 

At time t = 2, new uniform draws are appended to the current points, thereby 
forming sequences of length 2. The lines between the first two density graphs indicate 
which x2 is appended to which x;. At this point, the sequence reaching the middle x 
point has the largest weight because its x; point had high weight at the first stage and 
its x2 sits in a region of high density according to f2(x2|x1). The bottom path also 
had increased weight because its x2 landed in the same high-density region for t = 2. 
Again, g2 has not influenced the weights because it is uniform. 

At the next time step, the sequences have grown by appending sampled x3 
values, and they have been reweighted according to f3(x3|X1:2). At this point, sup- 
pose that the weights are sufficiently degenerate to produce an effective sample size 
N (g3, f3) small enough to trigger regeneration of the weights. This event is indicated 
by the dashed vertical line. To fix the problem, we resample three sequences with 
replacement from the current ones with probabilities proportional to their respective 
weights. In the figure, two sequences will now progress from the middle x3, one from 
the bottom x3, and none from the top x3. The latter sequence becomes extinct, to be 
replaced by two sequences having the same past up to t = 3 but different futures be- 
yond that. Finally, the right portion of the figure progresses as before, with additional 
samples at each ¢ and corresponding adjustments to the weights. 


Example 6.6 (High-Dimensional Distribution) Although the role of t has been 
emphasized here as indexing a sequence of random variables X; for t = 1,2,..., 
sequential importance sampling can also be used as a strategy for the generic prob- 
lem of sampling from a p-dimensional target distribution f, for fixed p. We can 
accumulate an importance weighted sample from this high-dimensional distribution 
one dimension at a time by sampling from the univariate conditional densities as sug- 
gested by (6.23) and (6.24). After p steps we obtain the xy’, and their corresponding 
weights needed to approximate a sample from fp. 

For example, consider a sequence of probability densities given by f;(X1:1) = 
kı exp{—||x1::||>/3} for some constants k;, where ||- || is the Euclidean norm and 
t=1,..., p. Here Xj.» is the random variable having density fp. Although f;(x1.;) 
is asmooth unimodal density with tails somewhat lighter than the normal distribution, 
there is no easy method for sampling from it directly. Note also that the sequence of 
random variables X4., is not Markov because for each t the density of X;|Xq-:—-1 
depends on all prior components of Xj.;. Thus this example requires the general 
sequential importance sampling approach described in Section 6.3.2.2. 

Let us adopt a standard normal distribution for the envelope. The tth conditional 
density can be expressed as fi(x:|X1:1-1) = fi(X1-1)/fp—-1(X1-+-1). The sequential im- 
portance sampling strategy is given by: 


1. Let t= 1. Sample n points HO an Oe i.i.d. from a standard normal 
EEP PEN : j i) 3/2 i 
distribution. Calculate the initial weights as wi? = exp { = P| } / o(x\?) 
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where ¢ is the standard normal density function. The constants k; vanish when 
the weights are standardized and are therefore ignored. 


2. For t > 1 sample n points x) from a standard normal distribution. Append 
these to the xe 1» Yielding the x 

3. Calculate the weight adjustment factors us? = FO A) / o(x\). 

4. Set wi? = wu? and standardize the new weights. 


5. Calculate the effective sample size N;(@, f;). Let œ control the tolerance for 
degeneracy. If N,(@, f;) < an then resample the current sample of x? with 


replacement with probabilities equal to the w® and discard the x? in favor of 
this new sample. In this case, reset all the weights to 1/n. 


6. Increment f and return to step 2 until t = p. 


Note that the resampling strategy in step 5 attempts to rejuvenate the sample peri- 
odically to reduce degeneracy; hence the algorithm is called sequential importance 
sampling with resampling. 

As an example, set p = 50, n = 5000, and a = 0.2. Over the 50 stages, step 5 
was triggered 8 times. The median effective sample size during the process was 
about 2000. Suppose we simply drew 5000 points from a 50-dimensional standard 
normal envelope and importance reweighted these points all at once. This one-step 
approach yielded an effective sample size of about 1.13. Increasing the sample size 
to n = 250,000 only increases the effective sample size to 1.95 for the one-step 
approach. High-dimensional space is too vast to stumble across many p-dimensional 
draws with high density using the one-step approach: good points must be generated 
sequentially one dimension at a time. 


6.3.2.4 Sequential Importance Sampling for Hidden Markov Models 
Another class of distributions to which sequential importance sampling can be applied 
effectively is generated by hidden Markov models. Consider a Markov sequence of 
unobservable variables Xo, X1, X2,... indexed by time t. Suppose that these vari- 
ables represent the state of some Markov process, so the distribution of X;|X1-;—-1 
depends only on x,_;. Although these states are unobservable, suppose that there is 
also an observable sequence of random variables Yo, Yı, Y2, ... where Y; is dependent 
on the process state at the same time, namely X;. Thus we have the model 


Y, ~ PyOr\X1) and Xp ~ px(%X|X1-1) 


fort = 1,2,...and py and py are density functions. This is called a hidden Markov 
process. 

We wish to use the observed y}.; as data to estimate the states x,., of the hidden 
Markov process. In the importance sampling framework /;(X1:;|y1.:) is the target 
distribution. 

Note that there is a recursive relationship between f; and f;—1. Specifically, 


Si Xrl¥11) = fe 1-1 111-1) Px X11 41-1) Py 1X7). (6.33) 


176 CHAPTER 6 SIMULATION AND MONTE CARLO INTEGRATION 


Suppose that at time t we adopt the envelope g;(;|X1:1-1) = Px(%1|x;~1). Then the 
multiplicative update for the importance weights can be expressed as 


= Si(X11lY1-0) 
S111 Yt 1) Px (11X11) 


ut 


= py(yilxr). (6.34) 


The final equality results from the substitution of (6.33) into (6.34). 

This framework can be recast in Bayesian terms. In this case, X1., are considered 
parameters. The prior distribution at time t is p,(xo) th 1 Px(%j|x;-1). The likelihood 
is obtained from the observed data density, equaling Tli Py(yilxi). The posterior 
Ft (%1:1l¥1-2) is proportional to the product of the prior and the likelihood, as obtained 
recursively from (6.33). Thus the importance weight update at time t is the likelihood 
obtained from the new data yy at time t. The sequential factorization given here is akin 
to iterating Example 6.4 in which we sampled from the prior distribution and weighted 
by the likelihood. A similar strategy is described by [113], where the procedure is 
generalized to sample dimensions in batches. 


Example 6.7 (Terrain Navigation) An airplane flying over uneven terrain can 
use information about the ground elevation beneath it to estimate its current location. 
As the plane follows its flight path, sequential elevation measurements are taken. 
Simultaneously, an inertial navigation system provides an estimated travel direction 
and distance. At each time point the previously estimated location of the plane is 
updated using both types of new information. Interest in such problems arises in, for 
example, military applications where the approach could serve as an alternative or 
backup to global satellite systems. Details on the terrain navigation problem are given 
by [30, 31, 287]. 

Let the two-dimensional variable X; = (X ;, X2;) denote the true location of 
the plane at time f, and let d; denote the measured drift, or shift in the plane location 
during the time increment as measured by the inertial navigation system. The key 
data for terrain navigation come from a map database that contains (or interpolates) 
the true elevation m(x,) at any location xz. 

Our hidden Markov model for terrain navigation is 


Y, = m(X;) + Or and X; = X1 + d; + & (6.35) 


where e€; and ô; are independent random error processes representing the error in drift 
and elevation measurement, respectively, and Y; is the observed elevation. We treat 
d; as a known term in the location process rather than a measurement and allow any 
measurement error to be subsumed into €z. 

Figure 6.9 shows a topographic map of a region in Colorado. Light shading 
corresponds to higher ground elevation and the units are meters. Let us suppose that the 
plane is following a circular arc specified by 101 angles 6; (fort = 0, ..., 100) equally 
spaced between 7/2 and 0, with the true location at time t being x; = (cos 6;, sin 0;) 
and the true drift d; being the difference between the locations at times ¢ and ¢ — 1. 
Let us assume that measurement error in the elevation process can be modeled as 
ôt ~ N(0, o?) where we assume o = 75 here. 
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FIGURE 6.9 Results of Example 6.7 showing an image of ground elevations for a region of 
Colorado, where lighter shades correspond to higher elevations. The dashed line is the true, 
unknown, flight path and the solid line is the estimated path. The two are nearly the same. 


Suppose that random error in location e€; has a distribution characterized by 


T — Xit Xt 2 1 0 
€; = R; Z; where R; = i and Z, ~ N2 | 0,q 0 RB where we 
—xX2¢ Xt 


take q = 400 and k = 7 This distribution g;(€,) effectively constitutes the impor- 
tance sampling envelope g;(x;|x;—1). This complicated specification is more simply 
described by saying that €; has a bivariate normal distribution with standard deviations 
q and kq, rotated so that the major axis of density contours is parallel to the tangent of 
the flight path at the current location. A standard bivariate normal distribution would 
be an alternative choice, but ours simulates the situation where uncertainty about the 
distance flown during the time increment is greater than uncertainty about deviations 
orthogonal to the direction of flight. 

In this example, we maintain n = 100 sampled trajectories xi) Lites x 
although this number would be much greater in a real application. To initiate the 
model we sample from a bivariate normal distribution centered at xo with standard 
deviations of 50. In real life, one could imagine that the initialization point corresponds 
to the departure airport or to some position update provided by occasional detection 
stations along the flight path, which provide highly accurate location data allowing 
the current location to be “reinitialized.” 

The sequential importance sampling algorithm for this problem proceeds as 
follows: 


1. Initialize at tf = 0. Draw n starting points xf ) fori = Tores: 
2. Receive observed elevation data Y;. 
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3. Calculate the weight update factor u® = $(yr, m(x®), o?) where b(-3.4, b?) is 
the value of the normal density with mean a and standard deviation b evaluated 
at its argument. 

4. Update the weights according to w® = wo ae If t = 0, then w? 1 = I/n. 
Normalize the weights so they sum to 1. 


5. The current estimate of the true location is & = Sy, wx, 


6. Check for weight degeneracy by calculating the effective sample size, namely 
Ng, fd) = 1/ EL (wP. If Ñ (g1, fi) < an, then rejuvenate the sample ac- 
cording to the substeps below. Here œ is a threshold that governs the lowest 
tolerable ratio of effective sample size to actual sample size. We used a = 0.3. 
If rejuvenation is not needed, proceed straight to step 7. 


a. Resample ete pirs Ky from x”, pas x” with replacement with prob- 


abilities w®, ..., w®. 


b. Replace the current sample of x? with the new draws xP ew- In other words, 


set x!) = xP ew 


c. Reset the weights to w® = 1/n for all i. 


7. Sample a set of location errors ey ~ g741(€). 


(i) (i) 


8. Advance the set of locations according to x; 41. =X + d;+ı + © 


t+1- 
9. Increment ¢ and return to step 2. 


Note that this algorithm incorporates the resampling option to reduce weight 
degeneracy. 

In this algorithm, each sequence x represents one possible path for the air- 
plane, and these complete paths have corresponding importance weights wi, Fig- 
ure 6.9 shows the result. The dashed line is the true flight path and the solid line is 
the estimated path calculated as in step 5 above, using the elevation data yz. 

The results show very good tracking of the true locations. Of course, the result 
is dependent on n and the magnitudes of noise in the state and observation processes. 
Performance here also benefits from the fact that Colorado terrain is very distinctive 
with large variation in elevations. In flat areas the procedure is less effective. Although 
the Colorado Rocky Mountains are rarely flat, there are long ridges and valleys that 
have relatively constant elevation, thereby tempting the algorithm to pursue false 
directions. The algorithm will also struggle when the terrain exhibits localized topo- 
logical features (e.g., hilltops) that resemble each other and are repeated across a 
region of the map. In that case, some X © may jump to the wrong hilltop. 

Although estimation performance is good in our example, maintenance of a 
large effective sample size was poor. A majority of the iterations included a rejuve- 
nation step. The data for this example are available from the book website. 


The sequential importance sampling technique evolved from a variety of meth- 
ods for sequential imputation and Monte Carlo analysis of dynamic systems [34, 386, 
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416, 418, 430]. Considerable research has focused on methods to improve or adapt 
gı and to slow weight degeneracy [90, 95, 168, 239, 260, 386, 417, 419, 679]. 


6.3.2.5 Particle Filters As we have discussed above, sequential importance sam- 
pling is often of little help unless resampling steps are inserted in the algorithm at 
least occasionally to slow weight degeneracy that is signaled by diminishing effective 
sample sizes. Particle filters are sequential Monte Carlo strategies that emphasize 
the need for preventing degeneracy [90, 167, 239, 274, 379]. They have been de- 
veloped primarily within the framework of hidden Markov models as described in 
Section 6.3.2.4. At time t, the current sample of sequences is viewed as a collection of 
weighted particles. Particles with high weights enjoy increasing influence in Monte 
Carlo estimates, while underweighted particles are filtered away as f increases. 

Particle filters can be seen as generalizations of sequential importance sampling, 
or as specific strategies for sequential importance sampling with resampling. The 
distinct names of these approaches hide their methodological similarity. As noted 
previously, sequential importance sampling can be supplemented by a resampling 
step where the x .; are resampled with replacement with probability proportional to 
their current weights and then the weights are reset to 1/n. In the simplest case, this 
would be triggered when the effective sample size diminished too much. Adopting a 
particle filter mindset, one would instead resample at each t. An algorithm described as 
a particle filter is often characterized by a stronger focus on resampling or adjustment 
of the new draws between or within cycles to prevent degeneracy. 

Resampling alone does not prevent degeneracy. Although low-weight se- 
quences are likely to be discarded, high-weight sequences are merely replicated rather 
than diversified. Particle filters favor tactics like perturbing or smoothing samples at 
each t. For example, with a particle filter one might supplement the resampling step by 
moving samples according to a Markov chain transition kernel with appropriate sta- 
tionary distribution [239]. Alternatively, one could smooth the resampling step via a 
weighted smoothed bootstrap [175], for example, by replacing the simple multinomial 
resampling with sampling from an importance-weighted mixture of smooth densities 
centered at some or all the current particles [252, 273, 611, 658]. Another strategy 
would be to employ an adaptive form of bridge sampling (see Section 6.3.1.1) to facil- 
itate the sequential sampling steps [260]. The references in Section 6.3.2.4 regarding 
improvements to sequential importance sampling are also applicable here. 

The simplest particle filter is the bootstrap filter [274]. This approach relies 
on a simple multinomial importance-weighted resample at each stage, rather than 
waiting for the weights to degenerate excessively. In other words, sequences are 
resampled with replacement with probability proportional to their current weights, 
then the weights are reset to 1 /n. The sequential importance sampling strategy we have 
previously described would wait until resampling was triggered by a low effective 
sample size before conducting such a resample. 


Example 6.8 (Terrain Navigation, Continued) The bootstrap filter is easy to 
implement in the terrain navigation example. Specifically, we always resample the 
current collection of paths at each t. 
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Thus, we replace step 6 in Example 6.7 with 
6. Regardless of the value of N, (8r, fe), carry out the following substeps: 


where those substeps are (a), (b), and (c) listed under step 6 in that example. 
The estimated flight path is qualitatively indistinguishable from the estimate in 
Figure 6.9. 


6.4 VARIANCE REDUCTION TECHNIQUES 


The simple Monte Carlo estimator of f h(x) f (x) dx is 
1 n 
inc = = 2 KX) 
i= 


where the variables X;,...,X, are randomly sampled from f. This approach is 
intuitively appealing, and we have thus far focused on methods to generate draws 
from f. In some situations, however, better Monte Carlo estimators can be derived. 
These approaches are still based on the principle of averaging Monte Carlo draws, 
but they employ clever sampling strategies and different forms of estimators to yield 
integral estimates with lower variance than the simplest Monte Carlo approach. 


6.4.1 Importance Sampling 


Suppose we wish to estimate the probability that a die roll will yield a one. If we roll 
the die n times, we would expect to see about n/6 ones, and our point estimate 
of the true probability would be the proportion of ones in the sample. The variance 
of this estimator is 5/36n if the die is fair. To achieve an estimate with a coefficient 
of variation of, say, 5%, one should expect to have to roll the die 2000 times. 

To reduce the number of rolls required, consider biasing the die by replacing the 
faces bearing 2 and 3 with additional | faces. This increases the probability of rolling 
a one to 0.5, but we are no longer sampling from the target distribution provided by 
a fair die. To correct for this, we should weight each roll of a one by 1/3. In other 
words, let Y; = 1/3 if the roll is a one and Y; = 0 otherwise. Then the expectation of 
the sample mean of the Y; is 1/6, and the variance of the sample mean is 1/36n. To 
achieve a coefficient of variation of 5% for this estimator, one expects to need only 
400 rolls. 

This improved accuracy is achieved by causing the event of interest to occur 
more frequently than it would in the naive Monte Carlo sampling framework, thereby 
enabling more precise estimation of it. Using importance sampling terminology, the 
die-rolling example is successful because an importance sampling distribution (cor- 
responding to rolling the die with three ones) is used to oversample a portion of the 
state space that receives lower probability under the target distribution (for the out- 
come of a fair die). An importance weighting corrects for this bias and can provide an 
improved estimator. For very rare events, extremely large reductions in Monte Carlo 
variance are possible. 
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The importance sampling approach is based upon the principle that the expec- 
tation of A(X) with respect to its density f can be written in the alternative form 


os J E / f(x) 
u= | A(x) f(x) dx = | h(x)—— g(x) dx (6.36) 
gx) 


or even 


= J hx) f(x) dx = J ROLES) dx 
J f(s) dx STF C0/g@)]8(x) dx 
where g is another density function, called the importance sampling function or 


envelope. 
Equation (6.36) suggests that a Monte Carlo approach to estimating E{h(X)} 


; (6.37) 


is to draw X),..., X, i.i.d. from g and use the estimator 
1 n 
fis = -Y hD X), (6.38) 
K i=1 


where w*(X;) = f(X;)/g(X;) are unstandardized weights, also called importance 
ratios. For this strategy to be convenient, it must be easy to sample from g and to 
evaluate f, even when it is not easy to sample from f. 

Equation (6.37) suggests drawing Xj,...,X, i.id. from g and using the 
estimator 


Ais = a h(Xj)w(Xj), (6.39) 
i=1 


where w(X;) = w*(X;)/ aa ı w*(X;) are standardized weights. This second ap- 
proach is particularly important in that it can be used when f is known only up 
to a proportionality constant, as is frequently the case when f is a posterior density 
in a Bayesian analyses. 

Both estimators converge by the same argument applied to the simple Monte 
Carlo estimator given in (6.1), as long as the support of the envelope includes all of 
the support of f. In order for the estimators to avoid excess variability, it is important 
that f(x)/g(x) be bounded and that g have heavier tails than f. If this requirement is 
not met, then some standardized importance weights will be huge. A rare draw from 
g with much higher density under f than under g will receive huge weight and inflate 
the variance of the estimator. 

Naturally, g(X) often will be larger than f(X) when X ~ g, yet it is easy to 
show that E{ f(X)/g(X)} = 1. Therefore, if f(X)/g(X) is to have mean 1, this ratio 
must sometimes be quite large to counterbalance the predominance of values between 
O and 1. Thus, the variance of f(X)/g(X) will tend to be large. Hence, we should 
expect the variance of A(X) f(X)/g(X) to be large, too. For an importance sampling 
estimate of u to have low variance, therefore, we should choose the function g so that 
Ff (&)/g(x) is large only when h(x) is very small. For example, when A is an indicator 
function that equals | only for a very rare event, we can choose g to sample in a 
way that makes that event occur much more frequently, at the expense of failing to 
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sample adequately uninteresting outcomes for which h(x) = 0. This strategy works 
very well in cases where estimation of a small probability is of interest, such as in 
estimation of statistical power, probabilities of failure or exceedance, and likelihoods 
over combinatorial spaces like those that arise frequently with genetic data. 

Also recall that the effective sample size given in (6.29) can be used to mea- 
sure the efficiency of an importance sampling strategy using envelope g. It can be 
interpreted as indicating that the n weighted samples used in an importance sampling 
estimate are worth N(g, f) unweighted i.i.d. samples drawn exactly from f and used 
in a simple Monte Carlo estimate [386, 417]. In this sense, it is an excellent way to 
assess the quality of the envelope g. Section 6.3.2.2 provides further detail. 

The choice between using the unstandardized and the standardized weights 
depends on several considerations. First consider the estimator jij, defined in (6.38) 
using the unstandardized weights. Let t(x) = h(x)w*(x). When X1, . . . , X, are drawn 
iid. from g, let w* and? denote averages of the w*(X;) and t(X;), respectively. Note 
E{w*} = E{w*(X)} = 1. Now, 


1 n 
E({fiis} = =) EUX =u (6.40) 
i=1 
and 
1# 1 
varlâis} = -3 XO var{t(X;)} = ~ Yartt(X)}. (6.41) 


i=1 


Thus ij; is unbiased, and an estimator of its Monte Carlo standard error is the sample 
standard deviation of t(X1), ..., (X, ) divided by n. 

Now consider the estimator ûs defined in (6.39) that employs importance 
weight standardization. Note that jij, = f/w*. Taylor series approximations yield 


Els) = E {70 -@* 1) + @* 17 +--} 


zah @— 2)(@* — 1) — po" DHIE- D+} 


1 1 
= u — —cov{t(X), w*(X)} + L var{w*(X)} +O (<=) $ (6.42) 

n n n 
Thus, standardizing the importance weights introduces a slight bias in the estimator 
fuss. The bias can be estimated by replacing the variance and covariance terms in (6.42) 


with sample estimates obtained from the Monte Carlo draws; see also Example 6.12. 
The variance of ûs is similarly found to be 


1 
var{Ars} = = [var{t(X)} + u’ var{w*(X)} — 2u cov{t(X), w*(X)}] 


+0(5). (6.43) 


Again, a variance estimate for /i;, can be computed by replacing the variances and 
covariances in (6.43) with sample estimates obtained from the Monte Carlo draws. 
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Finally, consider the mean squared errors (MSE) of ij, and Â. Combining 
the bias and variance estimates derived above, we find 


MSE{ fs} — MSE((s} 
ae (n? var{w*(X)} — 2u cov{t(X), w*(X)}) +0 (=) _ (6.44) 
n n 


Assuming without loss of generality that u > 0, the leading terms in (6.44) suggest 
that the approximate difference in mean squared errors is negative when 


cv{w*(X)} 


COHN) WO ran) 


(6.45) 
where cv{-} denotes a coefficient of variation. This condition can be checked 
using sample-based estimators as discussed above. Thus, using the standardized 
weights should provide a better estimator when w*(X) and h(X)w*(X) are strongly 
correlated. In addition to these considerations, a major advantage to using the 
standardized weights is that it does not require knowing the proportionality constant 
for f. Hesterberg warns that using the standardized weights can be inferior to using 
the raw weights in many settings, especially when estimating small probabilities, and 
recommends consideration of an improved importance sampling strategy we describe 
below in Example 6.12 [326]. Casella and Robert also discuss a variety of uses of the 
importance weights [100]. 

Using the standardized weights is reminiscent of the SIR algorithm (Sec- 
tion 6.3.1), and it is sensible to compare the estimation properties of Êg with those 


of the sample mean of the SIR draws. Suppose that an initial sample Yj, ..., Yn 
with corresponding weights w(Y1),..., w(Ym) is resampled to provide n SIR draws 
X 1,..., Xn, where n < m. Let 


1 n 
îs = — h(Xi 
fsk: = 7 pe (Xi) 
i=1 
denote the SIR estimate of u. 
When interest is limited to estimation of jz, the importance sampling estimator 
Ĝrs Ordinarily should be preferred over ûs. To see this, note 


in AY uY) 


El Âsm?} = Et{h(Xi)} = E {E{KX)Y1, ---,¥m}} = f DENA 


} = Blas 


Therefore the SIR estimator has the same bias as /i;,. However, the variance of 
fasip İS 
var{ûsr} = E {var{ fig l¥1, sees Ym}} + var {ElûsrlY1, sees Yn}} 
Dies A(V¥i)w* (Yi) 
= E i var{ficpl¥i,...,¥m} +var{ = 
{ SIR m } bo ie w*(Y;) 
var{û;s}- (6.46) 


IV 


Thus the SIR estimator provides convenience at the expense of precision. 
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FIGURE 6.10 Network connecting A and B described in Example 6.9. 


An attractive feature of any importance sampling method is the possibility of 
reusing the simulations. The same sampled points and weights can be used to compute 
a variety of Monte Carlo integral estimates of different quantities. The weights can be 
changed to reflect an alternative importance sampling envelope, to assess or improve 
performance of the estimator itself. The weights can also be changed to reflect an 
alternative target distribution, thereby estimating the expectation of h(X) with respect 
to a different density. 

For example, in a Bayesian analysis, one can efficiently update estimates based 
on a revised posterior distribution in order to carry out Bayesian sensitivity analysis or 
sequentially to update previous results via Bayes’ theorem in light of new information. 
Such updates can be carried out by multiplying each existing weight w(X;) by an 
adjustment factor. For example, if f is a posterior distribution for X using prior py, 
then weights equal to w(X;)p2(X;)/pi(&i) for i= 1,..., can be used with the 
existing sample to provide inference from the posterior distribution using prior p2. 


Example 6.9 (Network Failure Probability) | Many systems can be represented 
by connected graphs like Figure 6.10. These graphs are composed of nodes (circles) 
and edges (line segments). A signal sent from A to B must follow a path along any 
available edges. Imperfect network reliability means that the signal may fail to be 
transmitted correctly between any pair of connected nodes—in other words, some 
edges may be broken. In order for the signal to successfully reach B, a connected path 
from A to B must exist. For example, Figure 6.11 shows a degraded network where 
only a few routes remain from A to B. If the lowest horizontal edge in this figure were 
broken, the network would fail. 

Network graphs can be used to model many systems. Naturally, such a network 
can model transmission of diverse types of signals such as analog voice transmission, 
electromagnetic digital signals, and optical transmission of digital data. The model 
may also be more conceptual, with each edge representing different machines or 
people whose participation may be needed to achieve some outcome. Usually, an 
important quantity of interest is the probability of network failure given specific 
probabilities for the failure of each edge. 
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FIGURE 6.11 Network connecting A and B described in Example 6.9, with some edges 
broken. 


Consider the simplest case, where each edge is assumed to fail independently 
with the same probability, p. In many applications p is quite small. Bit error rates for 
many types of signal transmission can range from 107!° to 1073 or even lower [608]. 

Let X denote a network, summarizing random outcomes for each edge: intact 
or failed. The network considered in our example has 20 potential edges, so X = 
(X1,..., X20). Let bCX) denote the number of broken edges in X. The network in 
Figure 6.10 has b(X) = 0; the network in Figure 6.11 has b(X) = 10. Let A(X) indicate 
network failure, so A(X) = 1 if A is not connected to B, and A(X) = 0 if A and B are 
connected. The probability of network failure, then, is u = E{h(X)}. Computing u 
for a network of any realistic size can be a very difficult combinatorial problem. 

The naive Monte Carlo estimate of u is obtained by drawing X1, .. . , X, inde- 
pendently and uniformly at random from the set of all possible network configurations 
whose edges fail independently with probability p. The estimator is computed as 


1 n 
Pic = = > h(Xi). (6.47) 
i= 


Notice that this estimator has variance (1 — jz) /n. For n = 100,000 and p = 0.05, 
simulation yields Amc = 2.00 x 1075 with a Monte Carlo standard error of about 
1.41 x 1075. 

The problem with fiyjc is that A(X) is very rarely 1 unless p is unrealistically 
large. Thus, a huge number of networks may need to be simulated in order to estimate 
u with sufficient precision. Instead, we can use importance sampling to focus on 
simulation of X for which h(X) = 1, compensating for this bias through the assign- 
ment of importance weights. The calculations that follow adopt this strategy, using 
the nonstandardized importance weights as in (6.38). 

Suppose we simulate X}, ..., X* by generating network configurations formed 
by breaking edges in Figure 6.10, assuming independent edge failure with probability 
p* > p. The importance weight for X* can be written as 


w*(X4) = ( l-p T E = en (6.48) 
lap} pp) i l 
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and the importance sampling estimator of m is 
5 1X 
fis = — Y W(X} )w*(X}). (6.49) 
i=1 


Let C denote the set of all possible network configurations, and let F denote the 
subset of configurations for which A and B are not connected. Then 


1 
var{ûis} = 7 varth(X;)w"(X;)} (6.50) 
= 1 YY RZ KY). YR) TZ 
=- (E LARWA} - [E away) 65D 
ae bs (wrx) phd — py?) — è) : (6.52) 
it xeF 


Now, for a network derived from Figure 6.10, failure only occurs when b(X) > 4. 
Therefore, 


re 20 1—p*)14 
w@< ( 2) E p | i (6.53) 
Lp p*(1— p) 
When p* = 0.25 and p = 0.05, we find w*(X) < 0.07. In this case, 
1 
var {fis} < — [oor S > pl — py? — «| (6.54) 
n 
XEF 
1 
Se [oor Xap a = py? — | (6.55) 
n xeC 
0.07 u — u? 
A, (6.56) 


n 


Thus var{ûïs} is substantially smaller than var{fi,,~}. Under the approximation 
that cu — u? ~ cu for small u and relatively larger c, we see that var{ûÂmc}/ 
var{fijs} © 14. 

With the naive simulation strategy using p = 0.05, only 2 of 100,000 simulated 
networks failed. However, using the importance sampling strategy with p* = 0.2 
yielded 491 failing networks, producing an estimate of (ij; = 1.02 x 1075 with a 
Monte Carlo standard error of 1.57 x 1076. 

Related Monte Carlo variance reduction techniques for network reliability prob- 
lems are discussed in [432]. 


6.4.2 Antithetic Sampling 


A second approach to variance reduction for Monte Carlo integration relies on finding 
two identically distributed unbiased estimators, say i; and ûz, that are negatively 
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correlated. Averaging these estimators will be superior to using either estimator alone 
with double the sample size, since the estimator 


2 fi, + fia 
fas = H (6.57) 


has variance equal to 


2 
2o (6.58) 


1 1 
varas} = z (varfâi} + var{fo}) + scov{1, flo} = — 


where p is the correlation between the two estimators and o7/n is the variance of 
either estimator using a sample of size n. Such pairs of estimators can be generated 
using the antithetic sampling approach [304, 555]. 

Given an initial estimator, /i;, the question is how to construct a second, identi- 
cally distributed estimator ûz that is negatively correlated with 2. In many situations, 
there is a convenient way to create such estimators while reusing one simulation sam- 
ple of size n rather than generating a second sample from scratch. To describe the 
strategy, we must first introduce some notation. Let X denote a set of i.i.d. random 
variables, {X1,..., Xn}. Suppose ĝı (X) = Da hı(X;)/n, where h; is a real-valued 
function of m arguments, so hı(X;) = hı(Xi, --., Xim). Assume E{h,(X;)} = n. 
Let û2(X) = X`; h2(X;)/n be a second estimator, with the analogous assumptions 
about h2. 

We will now prove that if hı and h2 are both increasing in each argument (or 
both decreasing), then cov{h;(X;), 42(X;)} is positive. From this result, we will be 
able to determine requirements for hı and h3 that ensure that cor{/i;, 2} is negative. 

The proof proceeds via induction. Suppose the above hypotheses hold and 
m = 1. Then 


[h1(X) — hy(VY)] [h2(X) — h2(Y)] = 0 (6.59) 


for any random variables X and Y. Hence, the expectation of the left-hand side of 
(6.59) is also nonnegative. Therefore, when X and Y are independent and identically 
distributed, this nonnegative expectation implies 


cov {h1(X;j), h2(Xj)} 2 0. (6.60) 


Now, suppose that the desired result holds when X; is a random vector of length 
m — 1, and consider the case when X; = (Xj1,..., Xim). Then, by hypothesis, the 
random variable 


cov {h1 (X;), A2(Xi)| Xim} = 0. (6.61) 
Taking the expectation of this inequality gives 


0 < ELE (hm (Xiho(Xi)| Xim) } — ELE (MX) Xim) E (X)| Xim) } 
< E {hy(Xi)ho(Xj)} — ELE {h XÐ Xim} } ELE (ho(Xi)| Xim} } (6.62) 
= cov {h1 (X;), ha(X D}, 
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where the substitution of terms in the product on the right side of (6.62) follows from 
the fact that each E { h j(X;)| X im} for j = 1, 2 is a function of the single random 
argument Xin, for which the result (6.60) applies. 

Thus, we have proven by induction that h;(X;) and h2(X;) will be positively 
correlated in these circumstances; it follows that 4; and /i2 will also be positively 
correlated. We leave it to the reader to show the following key implication: If hı and 
hy are functions of m random variables U4, ..., Um, and if each function is mono- 
tone in each argument, then cov{h)(Uj,...,Um), hod — U1, ..., 1 — Um)} < 0. 
This result follows simply from our previous proof after redefining hı and h2 to create 
two functions increasing in their arguments that satisfy the previous hypotheses. See 
Problem 6.5. 

Now the antithetic sampling strategy becomes apparent. The Monte Carlo in- 
tegral estimate (X) can be written as 


n 
M= Fh (Fe "Wind... Fe! Wind) (6.63) 

n i=1 
where Fj is the cumulative distribution function of each X;; (j = 1,..., m) and the 
Uj; are independent Unif(O, 1) random variables. Since F; is a cumulative distribu- 
tion function, its inverse is nondecreasing. Therefore, hi(F, (Un), ee Fo (Gin) 
is monotone in each Uj; for j=1,...,m whenever hı is monotone in 


its arguments. Moreover, if Uj; ~ Unif(O, 1), then 1 — Uj; ~ Unif(O, 1). Hence, 
hoi — U;1,..., 1 — Uim) = hi(F7' (1 — Ui), Fold — Ujm)) is monotone in 


each argument and has the same distribution as h\(F, (Ui), .-., Fy (Uim). 
Therefore, 
: 1X Lı ii 
MX = -Yh (FEN = Un) s Fa A Uim)) (6:64) 
K i=1 


is a second estimator of jz having the same distribution as /;(X). Our analysis above 
allows us to conclude that 


cov{fi1(X), A(X); < 0. (6.65) 


Therefore, the estimator fi,g = ({41 + fi2)/2 will have smaller variance than (i; 
would have with a sample of size 2n. Equation (6.58) quantifies the amount of im- 
provement. We accomplish this improvement while generating only a single set of n 
random numbers, with the other n derived from the antithetic principle. 


Example 6.10 (Normal Expectation) Suppose X has a standard normal distri- 
bution and we wish to estimate u = E{h(X)} where h(x) = x/(2* — 1). A standard 
Monte Carlo estimator can be computed as the sample mean of n = 100,000 val- 
ues of h(X;) where X1,..., Xn ~ 1.i.d. N(O, 1). An antithetic estimator can be con- 
structed using the first n = 50,000 draws. The antithetic variate for X; is simply 
—X;, so the antithetic estimator is Aas = 3772)" [h(X;) + h(—X;)]/ 100,000. In 
the simulation, cor{h(X;), h(—X;)} = —0.95, so the antithetic approach is profitable. 
The standard approach yielded fiyyc = 1.4993 with a Monte Carlo standard error of 
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0.0016, whereas the antithetic approach gave fi,g = 1.4992 with a standard error 
of 0.0003 [estimated via (6.58) using the sample variance and correlation]. Further 
simulation confirms a more than fourfold reduction in standard error for the antithetic 
approach. 


Example 6.11 (Network Failure Probability, Continued) Recalling Example 6.9, 
let the ith simulated network, X;, be determined by standard uniform random variables 
Uil, ..., Uim, where m = 20. The jth edge in the ith simulated network is broken if 
Uij < p. Now h(X;) = h(Ui1, ..., Uim) equals 1 if A and B are not connected, and 
0 if they are connected. Note that A is nondecreasing in each Uj;; therefore the anti- 
thetic approach will be profitable. Since X; is obtained by breaking the jth edge when 
Uj; < pfor j = 1, ..., m, the antithetic network draw, say X*¥, is obtained by breaking 
the jth edge when U;; > 1 — p, for the same set of Uj; used to generate X;. The neg- 
ative correlation induced by this strategy will ensure that T Yi (A(X;) + h(X*)) 


is a superior estimator to + D h(Xj). 


6.4.3 Control Variates 


The control variate strategy improves estimation of an unknown integral by relat- 
ing the estimate to some correlated estimator of an integral whose value is known. 
Suppose we wish to estimate the unknown quantity u = E{h(X)} and we know 
of a related quantity, 0 = E{c(Y)}, whose value can be determined analytically. 
Let (X1, Y1),..., (Xn, Yn) denote pairs of random variables observed indepen- 
dently as simulation outcomes, so cov{X;, X;} = cov{Y;, Y;} = cov{X;, Y;} = 0 
when i Æ j. The simple Monte Carlo estimators are fiyyc = (1/n) X>; A(X;) and 
6uc = (1/n) S; c(Y;). Of course, Ouicis unnecessary, since 0 can be found analyti- 
cally. However, note that Êyc will be correlated with fyc when cor{h(X;), c(Y;)} # 0. 
For example, if the correlation is positive, an unusually high outcome for Êe should 
tend to be associated with an unusually high outcome for fyc. If comparison of Ôc 
with 0 suggests such an outcome, then we should adjust /i,4- downward accordingly. 
The opposite adjustment should be made when the correlation is negative. 
This reasoning suggests the control variate estimator 


licy = fiuc + AGyc — 8), (6.66) 
where A is a parameter to be chosen by the user. It is straightforward to show that 
var{ficy} = var{ûyuc} + A? var{Ouc} + 2A cov{ fine: uc}: (6.67) 
Minimizing this quantity with respect to à shows that the minimal variance, 


(cov{fime: Ouch) 


var{ yc} 


min (var{ûov}) = var{Âmc} s (6.68) 


is obtained when 


_ —covflÂmc Ouch 


x 6.69 
var{Ouct ( ) 
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This optimal à depends on unknown moments of h(X;) and c(Y;), but these can be 
estimated using the sample (X;, Y;),..., (X7, Yn). Specifically, using 


Esa = Y; — g? 
atut = X ea (6.70) 
i=1 


and 


n ha 2 h ler) = z| 
n(n — 1) 


Ov fc: Ouch = (6.71) 


i=1 


in (6.69) provides an estimator k, where @= (1/n) Da c(Y;) and h= 
(1/n) X; A(Y;). Further, plugging such sample variance and covariance estimates 
into the right-hand side of (6.68) provides a variance estimate for ficy. 

In practice, yc and Êyc often depend on the same random variables, so X; = 
Y;. Also, it is possible to use more than one control variate. In this case, we may write 
the estimator as ficy = Âmc + vei À j0OMc, j — 6;) when using m control variates. 

Equation (6.68) shows that the proportional reduction in variance obtained by 
using ficy instead of fiyc is equal to the square of the correlation between fiyjc and 
Oars If this result sounds familiar, you have astutely noted a parallel with simple 
linear regression. Consider the regression model E{h(X;)|Y; = yi} = Bo + bicy) 
with the usual regression assumptions and estimators. Then h=- Bi and fimc + 
K(é6mc —0)= Bo + B19. In other words, the control variate estimator is the fitted 
value on the regression line at the mean value of the predictor (i.e., at 0), and the 
standard error of this control variate estimator is the standard error for the fitted 
value from the regression. Thus, linear regression software may be used to obtain 
the control variate estimator and a corresponding confidence interval. When more 
than one control variate is used, multiple linear regression can be used to obtain 4; 
(i= 1,...,m) and ficy [555]. 

Problem 6.5 asks you to show that the antithetic approach to variance reduction 
can be viewed as a special case of the control variate method. 


Example 6.12 (Control Variate for Importance Sampling) Hesterberg suggests 
using a control variate estimator to improve importance sampling [326]. Recall that 
importance sampling is built upon the idea of sampling from an envelope that induces 
a correlation between h(X)w*(X) and w*(X). Further, we know E{w*(X)} = 1. Thus, 
the situation is well suited for using the control variate w* = )>7_, w*(X;)/n. If the 
average weight exceeds 1, then the average value of h(X)w*(X) is also probably 
unusually high, in which case fi;, probably differs from its expectation, u. Thus, the 
importance sampling control variate estimator is 


Âiscv = fig +A(0* — 1). (6.72) 


The value for à and the standard error of /ijgcy can be estimated from a regression of 
h(X)w*(X) on w*(X) as previously described. Like /i;,, which uses the standardized 
weights, the estimator /ijgcy has bias of order O(1/n), but will often have lower mean 
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squared error than the importance sampling estimator with unstandardized weights 
ûis given in (6.38). 


Example 6.13 (Option Pricing) A call option is a financial instrument that gives 
its holder the right—but not the obligation—to buy a specified amount of a financial 
asset, on or by a specified maturity date, for a specified price. For a European call 
option, the option can be exercised only at the maturity date. The strike price is the 
price at which the transaction is completed if the option is exercised. Let S® denote 
the price of the underlying financial asset (say, a stock) at time t. Denote the strike 
price by K, and let T denote the maturity date. When time T arrives, the holder of 
the call option will not wish to exercise his option if K > S because he can obtain 
the stock more cheaply on the open market. However, the option will be valuable if 
K < S because he can buy the stock at the low price K and immediately sell it at 
the higher market price S(P. It is of interest to determine how much the buyer of this 
call option should pay at time t = 0 for this option with strike price K at maturity 
date T. 

The Nobel Prize—winning model introduced by Black, Scholes, and Merton in 
1973 provides a popular approach to determining the fair price of an option using a 
stochastic differential equation [52, 459]. Further background on option pricing and 
the stochastic calculus of finance includes [184, 406, 586, 665]. 

The fair price of an option is the amount to be paid at time ¢ = O that would 
exactly counterbalance the expected payoff at maturity. We’ll consider the simplest 
case: a European call option on a stock that pays no dividends. The fair price of 
such an option can be determined analytically under the Black-Scholes model, but 
estimation of the fair price via Monte Carlo is an instructive starting point. According 
to the Black-Scholes model, the value of the stock at day T can be simulated as 


2\ T T 
s = 5 D E A 6.73 
P1 (27) 365 OV 365 (673) 


where r is the risk-free rate of return (typically the return rate of the U.S. Treasury 
bill that matures on day (T — 1), ø is the stock’s volatility [an annualized estimate 
of the standard deviation of log{ serps Oy under a lognormal price model], and Z 
is a standard normal deviate. If we knew that the price of the stock at day T would 
equal SC, then the fair price of the call option would be 
rT 
C= exp {= } max {0 sP- K}, (6.74) 


discounting the payoff to present value. Since S is unknown to the buyer of the call, 


the fair price to pay at t = O is the expected value of the discounted payoff, namely 
E{C}. Thus, a Monte Carlo estimate of the fair price to pay at t = 0 is 


o ee 
C= z > - C; (6.75) 
j= 
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where the C; are simulated from (6.73) and (6.74) fori =1,..., using an iid. 
sample of standard normal deviates, Z,,..., Zn. 

Since the true fair price, E{C}, can be computed analytically in this instance, 
there is no need to apply the Monte Carlo approach. However, a special type of 
European call option, called an Asian, path-dependent, or average-price option, has 
a payoff based on the average price of the underlying stock throughout the holding 
period. Such options are attractive for consumers of energies and commodities because 
they tend to be exposed to average prices over time. Since the averaging process 
reduces volatility, Asian options also tend to be cheaper than standard options. Control 
variate and many other variance reduction approaches for the Monte Carlo pricing of 
options like these are examined in [59]. 

To simulate the fair price of an Asian call option, simulation of stock value at 
maturity is carried out by applying (6.73) sequentially T times, each time advancing 
the stock price one day and recording the simulated closing price for that day, so 


2 (t) 
— 2 oZ 
st) = sO ex {" eS \ (6.76) 
PL 365. (365 
for a sequence of standard normal deviates, {ZO}, where t = 0,..., T — 1. The dis- 


counted payoff at day T of the Asian call option on a stock with current price S is 
defined as 


T - 
A = exp l-5) max{0, 5 — K}, (6.17) 
where S$ = Ea S/T and the S® fort = 1,..., T are the random variables rep- 


resenting future stock prices at the averaging times. The fair price to pay at t = 0 is 
E{A}, but in this case there is no known analytic solution for it. Denote the standard 
Monte Carlo estimator for the fair price of an Asian call option as 


1 
fiyc = A= = ms Ai, (6.78) 


where the A; are simulated independently as described above. 

If $ is replaced in (6.77) by the geometric average of the price of the underlying 
stock throughout the holding period, an analytic solution for E{ A} can be found [370]. 
The fair price is then 


2\ 1—1/N 
9 = S%@ de Ra a 
nex { ( T 730 


6 
T 
—~K®(c, — c2)exp \-z} (6.79) 
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where 

1 XO) oT o? c30°T 1 

c= log r 1+ : 
2 K 730 2 1095 2N 

c3T E 1 1/2 
C2 =0 — 
? 1095 2N : 
c3 = 1+ 1/N, 


where © is the standard normal cumulative distribution function, and N is the number 
of prices in the average. Alternatively, one could estimate the fair price of an Asian 
call option with geometric averaging using the same sort of Monte Carlo strategy 
described above. Denote this Monte Carlo estimator as ee 

The estimator Ôc makes an excellent control variate for estimation of u. Let 
icy = fiyc + (uc — 0). Since we expect the fair price of the two Asian options 
(arithmetic and geometric mean pricing) to be very highly correlated, a reasonable 
initial guess is to use A = —1. 

Consider a European call option with payoff based on the arithmetic mean 
price during the holding period. Suppose that the underlying stock has current price 
S® = 100, strike price K = 102, and volatility ø = 0.3. Suppose there are N = 50 
days to maturity, so simulation of the maturity price requires 50 iterations of (6.76). 
Assume the risk-free rate of return is r = 0.05. Then the analogous geometric mean 
price option has a fair price of 1.83. Simulations show that the true fair value of the 
arithmetic mean price option is roughly u = 1.876. Using n = 100,000 simulations, 
we can estimate u using either {iyjc Or ficy, and both estimators tend to give answers 
in the vicinity of u. But what is of interest is the standard error of estimates of u. We 
replicated the entire Monte Carlo estimation process 100 times, obtaining 100 values 
for fiyyc and for ficy. The sample standard deviation of the values obtained for fiyyc 
was 0.0107, whereas that of the icy values was 0.000295. Thus, the control variate 
approach provided an estimator with 36 times smaller standard error. 

Finally, consider estimating A from the simulations using (6.69). Repeating 
the same experiment as above, the typical correlation between yc and Êyc was 
0.9999. The mean of 4 was — 1.0217 with sample standard deviation 0.0001. Using 
the i found in each simulation to produce each ficy yielded a set of 100 Aey values 
whose standard deviation was 0.000168. This represents a 63-fold improvement in 
the standard error over fiyyc. 


6.4.4 Rao-Blackwellization 


We have been considering the estimation of u = E{h(X)} using a random sample 
X,,..., X, drawn from f. Suppose that each X; = (Xj, X;2) and that the conditional 
expectation E'{h(X;)|xj;2} can be solved for analytically. To motivate an alternate esti- 
mator to fiyqc, We may use the fact that E{h(X;)} = E {E{h(X;)|Xi2}}, where the outer 
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expectation is taken with respect to the distribution of Xj2. The Rao—Blackwellized 
estimator can be defined as 


1 n 
ign = — > Eth(Xi)|Xia} (6.80) 


i=1 


and has the same mean as the ordinary Monte Carlo estimator fiyc. Notice that 
1 1 ‘ 
var{ûmc) = —vartE{h(Xi)|Xi2}} + = E{varth(Xi)|Xi2}} = vartAgp} (6.81) 


follows from the conditional variance formula. Thus, /igp is superior to Âyc in terms 
of mean squared error. This conditioning process is often called Rao—Blackwellization 
due to its use of the Rao—Blackwell theorem, which states that one can reduce the 
variance of an unbiased estimator by conditioning it on the sufficient statistics [96]. 
Further study of Rao—Blackwellization for Monte Carlo methods is given in [99, 216, 
507, 542, 543]. 


Example 6.14 (Rao-Blackwellization of Rejection Sampling) A generic ap- 
proach that Rao—Blackwellizes rejection sampling is described by Casella and Robert 
[99]. In ordinary rejection sampling, candidates Yj, ..., Ym are generated sequen- 
tially, and some are rejected. The uniform random variables U;,..., Uy provide 
the rejection decisions, with Y; being rejected if U; > w*(Y;), where w*(Y;) = 
f(Yi)/e(Yi). Rejection sampling stops at a random time M with the acceptance of the 
nth draw, yielding X1,..., Xn. The ordinary Monte Carlo estimator of u = E{h(X)} 
can then be reexpressed as 


M 
1 
Âmc = R > AYD Ucu Y), (6.82) 
[= 


which presents the intriguing possibility that /2,,- somehow can be improved by using 
all the candidate Y; draws (suitably weighted), rather than merely the accepted draws. 
Rao—Blackwellization of (6.82) yields the estimator 


M 
` 1 
pp = — XL AYD), (6.83) 
n i=1 
where the ¢;(Y) are random quantities that depend on Y = (%,..., Ym) and M 


according to 


t(Y) = E {luj<we (vp |M Yi,- Ym} 
= P[U; < w*(Y)|M, Yı, ..., Ym]. (6.84) 


6.4 VARIANCE REDUCTION TECHNIQUES 195 


Now ty(Y) = | since the final candidate was accepted. For previous candidates, the 
probability in (6.84) can be found by averaging over permutations of subsets of the 
realized sample [99]. We obtain 


w* (Yi) een [yea w*(Y;) ĮI; g all — w*(¥j)] 


t)(Y) = ; (6.85) 
> BEB Iles w* (Yj) Hj¢sll — u*(Y))] 
where A; is the set of all subsets of {1,..., i — 1,i + 1,..., M — 1} containing n — 2 
elements, and $ is the set of all subsets of {1,..., M — 1} containing n — 1 elements. 


Casella and Robert [99] offer a recursion formula for computing the #;(Y), but it is 
difficult to implement unless n is fairly small. 

Notice that the conditioning variables used here are statistically sufficient since 
the conditional distribution of U1, ..., Um does not depend on f. Both fig and Âmc 
are unbiased; thus, the Rao—Blackwell theorem implies that fig, will have smaller 
variance than fiyyc. 


PROBLEMS 


6.1. Consider the integral sought in Example 5.1, Equation (5.7), for the parameter values 
given there. Find a simple rejection sampling envelope that will produce extremely few 
rejections when used to generate draws from the density proportional to that integrand. 


6.2. Consider the piecewise exponential envelopes for adaptive rejection sampling of the 
standard normal density, which is log-concave. For the tangent-based envelope, suppose 
you are limited to an even number of nodes at +c, ..., +c. For the envelope that does 
not require tangent information, suppose you are limited to an odd number of nodes at 
0, +d),...,+d,. The problems below will require optimization using strategies like 
those in Chapter 2. 


a. For n = 1,2,3,4,5, find the optimal placement of nodes for the tangent-based 
envelope. 


b. For n = 1,2,3,4,5, find the optimal placement of nodes for the tangent-free 
envelope. 


c. Plot these collections of envelopes; also plot rejection sampling waste against num- 
ber of nodes for both envelopes. Comment on your results. 


6.3. Consider finding o? = E{X?} when X has the density that is proportional to g(x) = 
exp{—|x|*/3}. 
a. Estimate o° using importance sampling with standardized weights. 
b. Repeat the estimation using rejection sampling. 


c. Philippe and Robert describe an alternative to importance-weighted averaging that 
employs a Riemann sum strategy with random nodes [506, 507]. When draws 
X,,..., X, originate from f, an estimator of E{h(X)} is 


n-1 


So Kien = Xwh Xi fXw, (6.86) 
i=1 
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FIGURE 6.12 Number of coal-mining disasters per year between 1851 and 1962. 


where Xj < +++ < Xj, is the ordered sample associated with X,,..., X„. This 
estimator has faster convergence than the simple Monte Carlo estimator. When 
f = cq and the normalization constant c is not known, then 


n—-1 
er, Xi — Xa X maX a) 


al (6.87) 
ye, Xian — Xa) 


estimates E{h(X)}, noting that the denominator estimates 1/c. Use this strategy to 
estimate o*, applying it post hoc to the output obtained in part (b). 


d. Carry out a replicated simulation experiment to compare the performance of the 
two estimators in parts (b) and (c). Discuss your results. 


6.4. Figure 6.12 shows some data on the number of coal-mining disasters per year between 


1851 and 1962, available from the website for this book. These data originally appeared 
in [434] and were corrected in [349]. The form of the data we consider is given in [91]. 
Other analyses of these data include [445, 525]. 

The rate of accidents per year appears to decrease around 1900, so we consider 
a change-point model for these data. Let j = 1 in 1851, and index each year thereafter, 
so j = 112 in 1962. Let X; be the number of accidents in year j, with X,,..., Xo ~ 
iid. Poisson(A,) and Xo41,..., X12 ~ iid. Poisson(A2). Thus the change-point oc- 
curs after the 8th year in the series, where 0 € {1,..., 111}. This model has parameters 
0, à1, and 42. Below are three sets of priors for a Bayesian analysis of this model. In 
each case, consider sampling from the priors as the first step of applying the SIR algo- 
rithm for simulating from the posterior for the model parameters. Of primary interest 
is inference about 0. 


6.5. 


6.6. 
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. Assume a discrete uniform prior for © on {1,2,...,111}, and priors A;|a; ~ 


Gamma(3, a;) and a; ~ Gamma(10, 10) independently for i = 1, 2. Using the SIR 
approach, estimate the posterior mean for 6, and provide a histogram and a credible 
interval for 0. Provide similar information for estimating à; and A. Make a scatter- 
plot of A; against à% for the initial SIR sample, highlighting the points resampled at 
the second stage of SIR. Also report your initial and resampling sample sizes, the 
number of unique points and highest observed frequency in your resample, and a 
measure of the effective sample size for importance sampling in this case. Discuss 
your results. 


. Assume that A2 = ad,. Use the same discrete uniform prior for 0 and àja ~ 


Gamma(3, a), a ~ Gamma(10, 10), and loga ~ Unif(log 1/8, log 2). Provide the 
same results listed in part (a), and discuss your results. 


. Markov chain Monte Carlo approaches (see Chapter 7) are often applied in the 


analysis of these data. A set of priors that resembles the improper diffuse priors used 
in some such analyses is: 0 having the discrete uniform prior, 4;|a; ~ Gamma(3, a;), 
and a; ~ Unif(0, 100) independently for i = 1, 2. Provide the same result listed in 
part (a), and discuss your results, including reasons why this analysis is more difficult 
than the previous two. 


Prove the following results. 


a. 


If h, and h, are functions of m random variables U,,..., Um, and if each function 
is monotone in each argument, then 


cov{h\(U,,..., Un), hod — Uj,...,1—U,)} < 0. 
. Let ĝı(X) estimate a quantity of interest, u, and let {i2(Y) be constructed from 
realizations Y,,..., Y,, chosen to be antithetic to X,,..., X,. Assume that both 


estimators are unbiased for u and are negatively correlated. Find a control variate 
for fi, say Z, with mean zero, for which the control variate estimator {icy = 
fi;(X) + AZ corresponds to the antithetic estimator based on fi; and ji when the 
optimal A is used. Include your derivation of the optimal À. 


Consider testing the hypotheses Hj: = 2 versus H,:4 > 2 using 25 observations 
from a Poisson(A) model. Rote application of the central limit theorem would suggest 
rejecting Ho at œ = 0.05 when Z > 1.645, where Z = (X — 2)/,/2/25. 


a. 


Estimate the size of this test (i.e., the type I error rate) using five Monte Carlo 
approaches: standard, antithetic, importance sampling with unstandardized and 
standardized weights, and importance sampling with a control variate as in 
Example 6.12. Provide a confidence interval for each estimate. Discuss the 
relative merits of each variance reduction technique, and compare the importance 
sampling strategies with each other. For the importance sampling approaches, 
use a Poisson envelope with mean equal to the Hp rejection threshold, namely 
à = 2.4653. 


. Draw the power curve for this test for à € [2.2, 4], using the same five techniques. 


Provide pointwise confidence bands in each case. Discuss the relative merits of each 
technique in this setting. Compare the performances of the importance sampling 
strategies with their performance in part (a). 
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6.7. 


6.8. 


6.9. 


CHAPTER 6 SIMULATION AND MONTE CARLO INTEGRATION 


Consider pricing a European call option on an underlying stock with current price 
S® = 50, strike price K = 52, and volatility o = 0.5. Suppose that there are N = 30 
days to maturity and that the risk-free rate of return is r = 0.05. 


a. Confirm that the fair price for this option is 2.10 when the payoff is based on S6% 
[i.e., a standard option with payoff as in (6.74)]. 


b. Consider the analogous Asian option (same S®, K, o, N, and r) with payoff based 
on the arithmetic mean stock price during the holding period, as in (6.77). Using 
simple Monte Carlo, estimate the fair price for this option. 


c. Improve upon the estimate in (b) using the control variate strategy described in 
Example 6.13. 


d. Try an antithetic approach to estimate the fair price for the option described in 
part (b). 


e. Using simulation and/or analysis, compare the sampling distributions of the esti- 
mators in (b), (c), and (d). 


Consider the model given by X ~ Lognormal(0, 1) andlog Y = 9 + 3 log X + e€, where 
e ~ N(0, 1). We wish to estimate E{Y/X}. Compare the performance of the standard 
Monte Carlo estimator and the Rao—Blackwellized estimator. 


Consider a bug starting at the origin of an infinite lattice L and walking one unit north, 
south, east, or west at each discrete time t. The bug cannot stay still at any step. Let 
X;., denote the sequence of coordinates (i.e., the path) of the bug up to time t, say 
{x; = (vi, wi) : i= 1,...,¢} with xo = (0, 0). Let the probability distribution for the 
bug’s path through time t be denoted f;(1:+) = fi(K1) fo(Xo|X1), <., fr(Xr|X11-1). 

Define D,(x;.,) to be the Manhattan distance of the bug from the origin at time 
t, namely D,(X1::) = |v,| + |w,|. Let R,(v, w) denote the number of times the bug has 
visited the lattice point (v, w) up to and including time t. Thus R,(x,) counts the number 
of visits to the current location. 

The bug’s path is random, but the probabilities associated with moving to the 
adjacent locations at time t are not equal. The bug prefers to stay close to home (i.e., near 
the origin), but has an aversion to revisiting its previous locations. These preferences 
are expressed by the path distribution f,(x1.,) x exp{—(D,(x;) + R,(x;)/2)}. 


a. Suppose we are interested in the marginal distributions of D,(x,) and M,(x,.,) = 
MaxX(y,wcr{R,(v, w)} where the latter quantity is the greatest frequency with which 
any lattice point has been visited. Use sequential importance sampling to simulate 
from the marginal distributions of D, and M, at time t = 30. Let the proposal distri- 
bution or envelope g,(x,|X;.;-1) be uniform over the four lattice points surrounding 
x,_1. Estimate the mean and standard deviation of D39(x39) and M30(X1:30). 


b. Let g,(x;|X1.,-1) be proportional to /;(x1.,) if x, is adjacent to x,_ and zero otherwise. 
Repeat part (a) using this choice for g, and discuss any problems encountered. In 
particular, consider the situation when the bug occupies an attractive location but 
arrived there via an implausible path. 


c. A self-avoiding walk (SAW) is similar to the bug’s behavior above except that 
the bug will never revisit a site it has previously occupied. Simulation of SAWs 
has been important, for example, in the study of long-chain polymers [303, 394, 
553]. Let f,(x;.,) be the uniform distribution on all SAWs of length t. Show that by 
using g;(X;|Xi-1) = fi(X;|X1+-1), the sequential update specified by w; = w,—1u; is 
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given by u, = c,_; where c,_, is the number of unvisited neighboring lattice points 
adjacent to x,_; at time ¢t — 1. Estimate the mean and standard deviation of D30(X30) 
and M30(X1.30). Discuss the possibility that the bug becomes entrapped. 


. Finally, try applying the simplistic method of generating SAWs by simulating 
paths disregarding the self-avoidance requirement and then eliminating any self- 
intersecting paths post hoc. Compare the efficiency of this method to the approach 
in part (c) and how it depends on the total number of steps taken (i.e., t >> 30). 


CHAPTER 7 


MARKOV CHAIN MONTE CARLO 


When a target density f can be evaluated but not easily sampled, the methods from 
Chapter 6 can be applied to obtain an approximate or exact sample. The primary use 
of such a sample is to estimate the expectation of a function of X ~ f(x). The Markov 
chain Monte Carlo (MCMC) methods introduced in this chapter can also be used to 
generate a draw from a distribution that approximates f, but they are more properly 
viewed as methods for generating a sample from which expectations of functions of 
X can reliably be estimated. MCMC methods are distinguished from the simulation 
techniques in Chapter 6 by their iterative nature and the ease with which they can be 
customized to very diverse and difficult problems. Viewed as an integration method, 
MCMC has several advantages over the approaches in Chapter 5: Increasing problem 
dimensionality usually does not slow convergence or make implementation more 
complex. 

A quick review of discrete-state-space Markov chain theory is provided in 
Section 1.7. Let the sequence {x} denote a Markov chain fort = 0, 1, 2,..., where 


xO = (x ve et o) and the state space is either continuous or discrete. For the 


types of Markov chains introduced in this chapter, the distribution of X converges 
to the limiting stationary distribution of the chain when the chain is irreducible and 
aperiodic. The MCMC sampling strategy is to construct an irreducible, aperiodic 
Markov chain for which the stationary distribution equals the target distribution f. For 
sufficiently large t, a realization X from this chain will have approximate marginal 
distribution f. A very popular application of MCMC methods is to facilitate Bayesian 
inference where f is a Bayesian posterior distribution for parameters X; a short review 
of Bayesian inference is given in Section 1.5. 

The art of MCMC lies in the construction of a suitable chain. A wide variety of 
algorithms has been proposed. The dilemma lies in how to determine the degree of 
distributional approximation that is inherent in realizations from the chain as well as 
estimators derived from these realizations. This question arises because the distribu- 
tion of X® may differ substantially from f when t is too small (note that t is always 
limited in computer simulations), and because the X are serially dependent. 

MCMC theory and applications are areas of active research interest. Our empha- 
sis here is on introducing some basic MCMC algorithms that are easy to implement 
and broadly applicable. In Chapter 8, we address several more sophisticated MCMC 
techniques. Some comprehensive expositions of MCMC and helpful tutorials include 
[70, 97, 106, 111, 543, 633] 


Computational Statistics, Second Edition. Geof H. Givens and Jennifer A. Hoeting. 
© 2013 John Wiley & Sons, Inc. Published 2013 by John Wiley & Sons, Inc. 
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7.1 METROPOLIS-HASTINGS ALGORITHM 


A very general method for constructing a Markov chain is the Metropolis—Hastings 
algorithm [324, 460]. The method begins at t = 0 with the selection of X® = x© 
drawn at random from some starting distribution g, with the requirement that 
f (x) > 0. Given X” = x, the algorithm generates XT) as follows: 


1. Sample a candidate value X* from a proposal distribution g (| x), 


2. Compute the Metropolis—Hastings ratio R (x, X*), where 


f (v) g(ul vy) 


Be OS eC: 


(7.1) 


Note that R (x, X*) is always defined, because the proposal X* = x* can 
only occur if f (x) > Oand g (x*| x) mae 


3. Sample a value for X“* according to the following: 


(7.2) 


xD X* with probability min {R (x, X*),1}, 
~ \x® otherwise. 


4. Increment ¢ and return to step 1. 


We will call the rth iteration the process that generates X“ = x, When the proposal 
distribution is symmetric so that g (x® x*) =g (x* x) , the method is known as the 
Metropolis algorithm [460]. 

Clearly, a chain constructed via the Metropolis—Hastings algorithm is Markov 
since X“+ is only dependent on X“), Whether the chain is irreducible and aperiodic 
depends on the choice of proposal distribution; the user must check these conditions 
diligently for any implementation. If this check confirms irreducibility and aperiod- 
icity, then the chain generated by the Metropolis—Hastings algorithm has a unique 
limiting stationary distribution. This result would seem to follow from Equation 
(1.44). However, we are now considering both continuous- and discrete-state-space 
Markov chains. Nevertheless, irreducibility and aperiodicity remain sufficient condi- 
tions for convergence of Metropolis—Hastings chains. Additional theory is provided in 
[462, 543]. 

To find the unique stationary distribution of an irreducible aperiodic Metropolis— 
Hastings chain, suppose X® ~ f(x), and consider two points in the state space of the 
chain, say x; and x2, for which f(x;) > O and f(x2) > 0. Without loss of generality, 
label these points in the manner such that f(x2)g(x1| x2) > f(x1)g(x2| x1). 

It follows that the unconditional joint density of X® = x; and X“t) = xp is 
f(&1)g(X2| x1), because if X = x, and X* = xp, then R(x), x2) > 1 so X = x2. 
The unconditional joint density of X = xz and X+») = x; is 


f(&1)g(X%2| x1) 


f(%2)9(X1| XD) Fe ex) 


(7.3) 
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because we need to start with X® = x», to propose X* = x4, and then to set xa) 
equal to X* with probability R(x,, x2). Note that (7.3) reduces to f(x1)g(x2| x1), 
which matches the joint density of X® = x; and X“+) = x». Therefore the joint 
distribution of X“ and X“+) is symmetric. Hence XO and X“+ have the same 
marginal distributions. Thus the marginal distribution of X“+! is f, and f must be 
the stationary distribution of the chain. 

Recall from Equation (1.46) that we can approximate the expectation of a func- 
tion of a random variable by averaging realizations from the stationary distribution of 
a Metropolis—Hastings chain. The distribution of realizations from the Metropolis— 
Hastings chain approximates the stationary distribution of the chain as t progresses; 
therefore E {h(X)} ~ (1/n) Doe 1h (x), Some of the useful quantities that can be 
estimated this way include means E {h(X)}, variances E{[h(X) — EAX}, and 
tail probabilities E lawsa} for constant q, where 1,4; = 1 if A is true and 0 
otherwise. Using the density estimation methods of Chapter 10, estimates of f itself 
can also be obtained. Due to the limiting properties of the Markov chain, estimates 
of all these quantities based on sample averages are strongly consistent. Note that the 
sequence x, x, ... will likely include multiple copies of some points in the state 
space. This occurs when X“+) retains the previous value x) rather than jumping to 
the proposed value x*. It is important to include these copies in the chain and in any 
sample averages since the frequencies of sampled points are used to correct for the 
fact that the proposal density differs from the target density. 

In some applications persistent dependence of the chain on its starting value 
can seriously degrade its performance. Therefore it may be sensible to omit some of 
the initial realizations of the chain when computing a sample average. This is called 
the burn-in period and is an essential component of MCMC applications. As with 
optimization algorithms, it is also a good idea to run MCMC procedures like the 
Metropolis—Hastings algorithm from multiple starting points to check for consistent 
results. See Section 7.3 for implementation advice about burn-in, number of chains, 
starting values, and other aspects of MCMC implementation. 

Specific features of good proposal distributions can greatly enhance the per- 
formance of the Metropolis—Hastings algorithm. A well-chosen proposal distribution 
produces candidate values that cover the support of the stationary distribution in a 
reasonable number of iterations and, similarly, produces candidate values that are not 
accepted or rejected too frequently [111]. Both of these factors are related to the spread 
of the proposal distribution. If the proposal distribution is too diffuse relative to the 
target distribution, the candidate values will be rejected frequently and thus the chain 
will require many iterations to adequately explore the space of the target distribution. 
If the proposal distribution is too focused (e.g., has too small a variance), then the 
chain will remain in one small region of the target distribution for many iterations 
while other regions of the target distribution will not be adequately explored. Thus 
a proposal distribution whose spread is either too small or too large can produce a 
chain that requires many iterations to adequately sample the regions supported by the 
target distribution. Section 7.3.1 further discusses this and related issues. 

Below we introduce several Metropolis—Hastings variants obtained by using 
different classes of proposal distributions. 
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7.1.1 Independence Chains 


Suppose that the proposal distribution for the Metropolis—Hastings algorithm is cho- 
sen such that g (x*| x) = g(x”) for some fixed density g. This yields an indepen- 
dence chain, where each candidate value is drawn independently of the past. In this 
case, the Metropolis—Hastings ratio is 


* (t) 
a(x, x*) = LEO) (7.4) 


F(x) 8 (X") 


The resulting Markov chain is irreducible and aperiodic if g(x) > 0 whenever 
f(x) >0. 

Notice that the Metropolis—Hastings ratio in (7.4) can be reexpressed as the ratio 
of importance ratios (see Section 6.4.1) where f is the target and g is the envelope: If 
w* = f (X*)/g(X*) and w = f (x) /g (x), then R(x, X*) = w*/w. This 
reexpression indicates that when w™ is much larger than typical w* values, then 
the chain will tend to get stuck for long periods at the current value. Therefore, the 
criteria discussed in Section 6.3.1 for choosing importance sampling envelopes are 
also relevant here for choosing proposal distributions: The proposal distribution g 
should resemble the target distribution f, but should cover f in the tails. 


Example 7.1 (Bayesian Inference) MCMC methods like the Metropolis—Hastings 
algorithm are particularly popular tools for Bayesian inference, where some data 
y are observed with likelihood function L(@|y) for parameters 0 which have prior 
distribution p(@). Bayesian inference is based on the posterior distribution p(@|y) = 
c p(6)L(@\y), where c is an unknown constant. The difficulty of computing c and 
other features of the posterior prevents most direct inferential strategies. However, 
if we can obtain a sample from a Markov chain whose stationary distribution is 
the target posterior, this sample may be used to estimate posterior moments, tail 
probabilities, and many other useful quantities, including the posterior density itself. 
MCMC methods typically allow easy generation of such a sample in the Bayesian 
context. 

A very simple strategy is to use the prior as a proposal distribution in an indepen- 
dence chain. In our Metropolis—Hastings notation, f(0) = p(@|y) and 9(6*) = p(0*). 
Conveniently, this means 


ay L| y) 
R(0, *) = n (1.5) 


In other words, we propose from the prior, and the Metropolis—Hastings ratio equals 
the likelihood ratio. By definition, the support of the prior covers the support of the 
target posterior, so the stationary distribution of this chain is the desired posterior. 
There are more specialized MCMC algorithms to sample various types of posteriors 
in more efficient manners, but this is perhaps the simplest generic approach. 
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FIGURE 7.1 Histogram of 100 observations simulated from the mixture distribution (7.6) 
in Example 7.2. 


Example 7.2 (Mixture Distribution) Suppose we have observed data y1, y2,..., 
y100 Sampled independently and identically distributed from the mixture distribution 


5N(7, 0.5") + (1 — 6)N(10, 0.57). (7.6) 


Figure 7.1 shows a histogram of the data, which are available from the website for 
this book. Mixture densities are common in real-life applications where, for example, 
the data may come from more than one population. We will use MCMC techniques 
to construct a chain whose stationary distribution equals the posterior density of 6 
assuming a Unif(0,1) prior distribution for ô. The data were generated with ô = 0.7, 
so we should find that the posterior density is concentrated in this area. 

Inthis example, we try two different independence chains. In the first case we use 
a Beta(1,1) density as the proposal density, and in the second case we use a Beta(2,10) 
density. The first proposal distribution is equivalent to a Unif(0,1) distribution, while 
the second is skewed right with mean approximately equal to 0.167. In this second 
case values of ô near 0.7 are unlikely to be generated from the proposal distribution. 

Figure 7.2 shows the sample paths for 10,000 iterations of both chains. A sample 
path is a plot of the chain realizations 5 against the iteration number t. This plot 
is useful for investigating the behavior of the Markov chain and is discussed further 
in Section 7.3.1. The top panel of Figure 7.2 corresponds to the chain generated 
using the Beta(1,1) proposal density. This panel shows a Markov chain that moves 
quickly away from its starting value and seems easily able to sample values from 
all portions of the parameter space supported by the posterior for ô. Such behavior 
is called good mixing. The lower panel corresponds to the chain using a Beta(2,10) 
proposal density. The resulting chain moves slowly from its starting value and does 
a poor job of exploring the region of posterior support (i.e., poor mixing). This chain 
has clearly not converged to its stationary distribution since drift is still apparent. Of 
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FIGURE 7.2 Sample paths for ô from independence chains with proposal densities Beta(1,1) 
(top) and Beta(2,10) (bottom) considered in Example 7.2. 


course, the long-run behavior of the chain will in principle allow estimation of aspects 
of the posterior distribution for ô since the posterior is still the limiting distribution of 
the chain. Yet, chain behavior like that shown in the bottom panel of Figure 7.2 does 
not inspire confidence: The chain seems nonstationary, only a few unique values of 
5 were accepted, and the starting value does not appear to have washed out. A plot 
like the lower plot in Figure 7.2 should make the MCMC user reconsider the proposal 
density and other aspects of the MCMC implementation. 

Figure 7.3 shows histograms of the realizations from the chains, after the first 
200 iterations have been omitted to reduce the effect of the starting value (see the 
discussion of burn-in periods in Section 7.3.1.2). The top and bottom panels again 
correspond to the Beta(1,1) and Beta(2,10) proposal distributions, respectively. This 
plot shows that the chain with the Beta(1,1) proposal density produced a sample for 
ô whose mean well approximates the true value (and posterior mean) of ô = 0.7. 
On the other hand, the chain with the Beta(2,10) proposal density would not yield 
reliable estimates for the posterior or the true value of ô based on the first 10,000 
iterations. 


7.1.2 Random Walk Chains 


A random walk chain is another type of Markov chain produced via a simple variant 
of the Metropolis—Hastings algorithm. Let X* be generated by drawing € ~ h(e) for 
some density h and then setting X* =x + e. This yields a random walk chain. 
In this case, g (x*| x) =h (x — x), Common choices for h include a uniform 
distribution over a ball centered at the origin, a scaled standard normal distribution, 
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FIGURE 7.3 Histograms of 5” for iterations 201—10,000 of independence chains with pro- 
posal densities Beta(1,1) (top) and Beta(2,10) (bottom) considered in Example 7.2. 


and a scaled Student’s t distribution. If the support region of f is connected and A is 
positive in a neighborhood of 0, the resulting chain is irreducible and aperiodic [543]. 

Figure 7.4 illustrates how a random walk chain might progress in a two- 
dimensional problem. The figure shows a contour plot of a two-dimensional target 
distribution (dotted lines) along with the first steps of a random walk MCMC pro- 
cedure. The sample path is shown by the solid line connecting successive values in 
the chain (dots). The chain starts at x, The second candidate value is accepted to 
yield x“), The circles around x and x) show the proposal densities, where h is 
a uniform distribution over a disk centered at the origin. In a random walk chain, 
the proposal density at iteration t + 1 is centered around x. Some candidate values 
are rejected. For example, the 13th candidate value, denoted by o, is not accepted, 
so x3) = x“). Note how the chain frequently moves up the contours of the target 
distribution, while also allowing some downhill moves. The move from x“!>) to x 
is one instance where the chain moves downhill. 


Example 7.3 (Mixture Distribution, Continued) As a continuation of Exam- 
ple 7.2, consider using a random walk chain to learn about the posterior for 6 under 
a Unif(0,1) prior. Suppose we generate proposals by adding a Unif(—a, a) random 
increment to the current 5“), Clearly it is likely that some proposals will be generated 
outside the interval [0, 1] during the progression of the chain. An inelegant approach 
is to note that the posterior is zero for any ô ¢ [0, 1], thereby forbidding steps to such 
points. An approach that usually is better involves reparameterizing the problem. 
Let U = logit{5} = log {8/( — ô)}. We may now run a random walk chain on U, 
generating a proposal by adding, say, a Unif(—b, b) random increment to u®. 
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FIGURE 7.4 Hypothetical random walk chain for sampling a two-dimensional target distri- 
bution (dotted contours) using proposed increments sampled uniformly from a disk centered 
at the current value. See text for details. 


There are two ways to view the reparameterization. First, we may run the 
chain in 5-space. In this case, the proposal density g(-|u“) must be transformed into a 
proposal density in 6-space, taking account of the Jacobian. The Metropolis—Hastings 
ratio for a proposed value 5* is then 


f (8) g (logit {6 } | logit {8*}) | J(8)| 
Ff (8) g (logit{ 5*} | logit {8}) |J(S*)| 


where, for example, |J (6)| is the absolute value of the (determinant of the) Jacobian 
for the transformation from ô to u, evaluated at 5. The second option is to run 
the chain in u-space. In this case, the target density for 6 must be transformed into 
a density for u, where 6 = logit !{U} = exp{U}/(1 + exp{U}). For U* = u*, this 
yields the Metropolis—Hasting ratio 


f (logit! {w*}) Jug (uw | u*) 
f (logit {uw }) |W) g (u*| uM)” 


Since | /(u*)| = 1/|J(6*)|, we can see that these two viewpoints produce equivalent 
chains. Examples 7.10 and 8.1 demonstrate the change-of-variables method within 
the Metropolis—Hastings algorithm. 

The random walk chain run with uniform increments in a reparameterized space 
may have quite different properties than one generated from uniform increments in the 
original space. Reparameterization is a useful approach to improving the performance 
of MCMC methods and is discussed further in Section 7.3.1.4. 

Figure 7.5 shows sample paths for 5 from two random walk chains run in 
u-space. The top panel corresponds to a chain generated by drawing e ~ Unif(—1,1), 
setting U* = u + e, and then using (7.8) to compute the Metropolis—Hastings ratio. 


(7.7) 


(7.8) 
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FIGURE 7.5 Sample paths for ô from random walk chains in Example 7.3, run in u-space 
with b = 1 (top) and b = 0.01 (bottom). 


The top panel in Figure 7.5 shows a Markov chain that moves quickly away from 
its starting value and seems easily able to sample values from all portions of the 
parameter space supported by the posterior for ô. The lower panel corresponds to 
the chain using € ~ Unif(—0.01,0.01), which yields very poor mixing. The resulting 
chain moves slowly from its starting value and takes very small steps in 6-space at 
each iteration. 


7.2 GIBBS SAMPLING 


Thus far we have treated X” with little regard to its dimensionality. The Gibbs 
sampler is specifically adapted for multidimensional target distributions. The goal is 
to construct a Markov chain whose stationary distribution—or some marginalization 
thereof—equals the target distribution f. The Gibbs sampler does this by sequentially 
sampling from univariate conditional distributions, which are often available in closed 
form. 


7.2.1 Basic Gibbs Sampler 


Recall X = (X1, a Xp) ',anddenote X-; =X ey Xap Ai yes xD. Sup- 
pose that the univariate conditional density of X;|X-; = x-;, denoted f (x;| x-j), is eas- 
ily sampled for i = 1, ..., p. A general Gibbs sampling procedure can be described 
as follows: 


1. Select starting values x, and set t = 0. 
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2. Generate, in turn, 


(t+1) (t) 
Xi oF Gale ese), 


(t+1) (t+1) ($ 
X3 |- ~ f (xolef O OR oy 


(7.9) 
(t+1) +1) (+1) (t+1) 
Xp-1 -~ f (xoil , X3 eee ee) : 
t+1 (t+1) (+1) (t+1) 
x | ~ f (xplaf , X3 ee aay 


where |- denotes conditioning on the most recent updates to all other elements 
of X. 


3. Increment ¢ and go to step 2. 


The completion of step 2 for all components of X is called a cycle. Several methods for 
improving and generalizing the basic Gibbs sampler are discussed in Sections 7.2.3- 


7.2.6. In subsequent discussion of the Gibbs sampler, we frequently refer to the 
(t) 


term x_;, which represents all the components of x, except for x;, at their current 
values, so 
O _ (a+) (+1) ($ (t) 
x= o reeeo Agen o Xip o Ap e 


Example 7.4 (Stream Ecology) Stream insects called benthic invertebrates are an 
effective indicator for monitoring stream ecology because their relatively stationary 
substrate habitation provides constant exposure to contamination and easy sampling 
of large numbers of individuals. Imagine that at many sites along a stream, insects are 
collected and classified into several categories based on some ecologically significant 
criterion. At a particular site, let Y1, ..., Ye denote the counts of insects in each of c 
different classes. 

The probability that an insect is classified in each category varies randomly 
from site to site, as does the total number collected at each site. For a given site, let 
Pi, ..., Pe denote the class probabilities, and let N denote the random total number 
of insects collected. Suppose, further, that the P|,..., Pe depend on a set of site- 
specific features summarized by parameters a1, ..., Œc, respectively. Let N depend 
on a site-specific parameter, À. 

Suppose two competing statistics, 7|(Y1,..., Ye) and 72(Y|,..., Yc), are used 
to monitor streams for negative environmental events. An alarm is triggered if the value 
of Tı or Tz exceeds some threshold. To compare the performance of these statistics 
across multiple sites within the same stream or across different types of streams, 
a Monte Carlo simulation experiment is designed. The experiment is designed by 
choosing a collection of parameter sets (A, œ1, ..., Œc) that are believed to encompass 
the range of sampling effort and characteristics of sites and streams likely to be 
monitored. Each parameter set corresponds to a hypothetical sampling effort at a 
simulated site. 
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Let c = 3. For a given simulated site, we can establish the model: 


(V1, Y2, ¥3)|(N =n, Pi = pi, P2 = p, P3 = p3) ~ Multinomial (n; p1, p2, p3), 
(P|, P2, P3) ~ Dirichlet (a1, a2, a3), 
N ~ Poisson(A), 


where N is viewed as random because it varies across sites. This model is overspecified 
because we require Yı + Y2 + Y3 = N and Pı + P2 + Ps = 1. Therefore, we can 
write the state of the model as X = (Y1, Y2, P1, P2, N), where the remaining variables 
can be determined analytically for any value of X. Cassella and George offer a related 
model for the hatching of insect eggs [97]. More sophisticated models of stream 
ecology data are given in [351]. 

To complete the simulation experiment, it is necessary to sample from the 
marginal distribution of (Y1, Y2, Y3) so that the performance of the statistics T; and 
T may be compared for a simulated site of the current type. Having repeated this 
process over the designed set of simulated sites, comparative conclusions about T; 
and T can be drawn. 

It is impossible to get a closed-form expression for the marginal distribution 
of (Y1, Y2, Y3) given the parameters A, a1, a2, and a3. The most succinct way to 
summarize the Gibbs sampling scheme for this problem is 


(Y1, Y2, Y3)|- ~ Multinomial (n; pı, p2, p3), 
(Pi, P2, P3)|- ~ Dirichlet (y1 + a1, y2 + &2, n — yj — y2 + &3), (7.10) 
N — yı — y2|- ~ Poisson (À (1 — pı — p2)), 


where |- denotes that the distribution is conditional on the variables remaining from 
the complete set of variables {N, Y1, Yo, Y3, P4, Po, P3}. Problem 7.4 asks you to 
derive these distributions. 

At first glance, (7.10) does not seem to resemble the univariate sampling strat- 
egy inherent in a Gibbs sampler. It is straightforward to show that (7.10) amounts 
to the following sampling scheme based on univariate conditional distributions of 
components of X: 


(t+1) (t) p? 
Yi |- oy. Bin n® = y2 , © > 


1- P3 
(t+1) (t) (t+1) py 
g t 
Y3 |- ~ Bin n” -y s, on 
1—p; 
Ro 1 1 1 
! z|: ~ Beta Ci ) taj, nO — yf _ yy ) +03) , 
l- p 
2 
(t+1) 
Py ( (+1) 


H1 1 1 
YTD tan, nO yf S58 403), 
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and 


NED _ yee = ygt” - ~ Poisson (aa — pE» — a) ; 


In Section 7.2.4 an alternative Gibbs approach uses (7.10) directly. 


Example 7.5 (Bayesian Inference, Continued) The Gibbs sampler is particu- 
larly useful for Bayesian applications when the goal is to make inference based 
on the posterior distribution of multiple parameters. Recall Example 7.1 where the 
parameter vector 0 has prior distribution p(0) and likelihood function L(0|y) aris- 
ing from observed data y. Bayesian inference is based on the posterior distribution 
p(0iy) = c p(@)L(@ly), where c is an unknown constant. When the requisite univari- 
ate conditional densities are easily sampled, the Gibbs sampler can be applied and 
does not require evaluation of the constant c = f P(9)L(Oly) dé. In this case the ith 
step in a cycle of the Gibbs sampler at iteration ¢ is given by draws from 


af] (0, ¥) ~ p (ale) 


where p is the univariate conditional posterior of 6; given the remaining parameters 
and the data. 


Example 7.6 (Fur Seal Pup Capture—Recapture Study) By the late 1800s fur 
seals in New Zealand were nearly brought to extinction by Polynesian and European 
hunters. In recent years the abundance of fur seals in New Zealand has been increasing. 
This increase has been of great interest to scientists, and these animals have been 
studied extensively [61, 62, 405]. 

Our goal is to estimate the number of pups in a fur seal colony using a capture— 
recapture approach [585]. In such studies, separate repeated efforts are made to count a 
population of unknown size. In our case, the population to be counted is the population 
of pups. No single census attempt is likely to provide a complete enumeration of the 
population, nor is it even necessary to try to capture most of the individuals. The 
individuals captured during each census are released with a marker indicating their 
capture. A capture of a marked individual during any subsequent census is termed 
a recapture. Population size can be estimated on the basis of the history of capture 
and recapture data. High recapture rates suggest that the true population size does not 
greatly exceed the total number of unique individuals ever captured. 

Let N be the unknown population size to be estimated using J census attempts 
yielding total numbers of captures (including recaptures) equaling € = (ci, eae n: 
We assume that the population is closed during the period of the sampling, which 
means that deaths, births, and migrations are inconsequential during this period. The 
total number of distinct animals captured during the study is denoted by r. 

We consider a model with separate, unknown capture probabilities for each 
census effort, œ = (a1, 1.2, 1) This model assumes that all animals are equally 
catchable on any one capture occasion, but capture probabilities may change over 
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TABLE 7.1 Fur seal data for seven census efforts in one season. 


Census Attempt, i 


1 2 3 4 5 6 T 
Number captured Ci 30 22; 29 26 31 32 35 
Number newly caught mi 30 8 17 T 9 8 5 


time. The likelihood for this model is 


N! 3 Ci N-¢j 
L(N, ale, r) x Wan [e (apni. (7.11) 
This model is sometimes called the M(t) model [55]. 

In a capture-recapture study conducted on the Otago Peninsula on the South 
Island of New Zealand, fur seal pups were marked and released during J = 7 census 
attempts during one season. It is reasonable to assume the population of pups was 
closed during the study period. Table 7.1 shows the number of pups captured (c;) and 
the number of these captures corresponding to pups never previously caught (m;), 
for census attempts i = 1, ...,7. A total of r = Sai mi = 84 unique fur seals were 
observed during the sampling period. 

For estimation, one might adopt a Bayesian framework where N and @ are 
assumed to be a priori independent with the following priors. For the unknown popu- 
lation size we use an improper uniform prior f(V) œ 1. For the capture probabilities, 
we use 


f(ailO1, 62) = Beta(@1, 62) (7.12) 


fori = 1,..., 7, and we assume these are a priori independent. If 0; = 02 = 5 this 
corresponds to the Jeffreys prior. The combination of a uniform prior for N and 
a Jeffreys prior for œ; is recommended when I > 5 [653]. This leads to a proper 
posterior distribution for the parameters when J > 2 and there is at least one recapture 
(ci — m; > 1). A Gibbs sampler can then be constructed by simulating from the 
conditional posterior distributions 


7 
ND 84|: ~ NegBin (s 1— II (1 = 2) (7.13) 
i=l 
(+1) Ltn l 
a" |- ~ Beta cita N ain (7.14) 
fori = 1,..., 7. Here |: denotes conditioning on the parameters among {N, a, 61, 62} 


as well as the data in Table 7.1, and NegBin denotes the negative binomial distribution. 

The results below are based on a chain of 100,000 iterations with the first 
50,000 iterations discarded for burn-in. Diagnostics (see Example 7.10) do not indicate 
any problems with convergence. To investigate whether the model produces sensible 
results, one can compute the mean capture probability for each iteration and compare 
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FIGURE 7.6 Split boxplots of & against N for the seal pup example. 


it to the corresponding simulated population size. Figure 7.6 shows split boxplots of 
a) = 5 ea ol? from (7.13) for each population size N® from (7.14). As expected, 
the population size increases as the mean probability of capture decreases. Figure 7.7 
shows a histogram of the realizations of N upon which posterior inference about 
N is based. The posterior mean of N is 89.5 with a 95% highest posterior density 
(HPD) interval of (84, 94). (A 95% HPD interval for N is the region of shortest length 
containing 95% of the posterior probability for N for which the posterior density 
for every point contained in the interval is never lower than the density for every 
point outside the interval. See Section 7.3.3 for computational details for HPDs using 
MCMC.) For comparison, the maximum likelihood estimate for N is 88.5 and a 95% 
nonparametric bootstrap confidence interval is (85.5, 97.3). 

The likelihood given in (7.11) is just one of the many forms of capture—recapture 
models that could have been considered. For example, a model with a common capture 
probability may be more appropriate. Other parameterizations of the problem might 
also be investigated to improve MCMC convergence and mixing, which is strongly 
dependent on the parameterization and updating of (01, 02). We consider these further 
in Examples 7.7 and 7.10. 


7.2.2 Properties of the Gibbs Sampler 


Clearly the chain produced by a Gibbs sampler is Markov. Under rather mild condi- 
tions, Geman and Geman [226] showed that the stationary distribution of the Gibbs 
sampler chain is f. It also follows that the limiting marginal distribution of X o equals 
the univariate marginalization of the target distribution along the ith coordinate. As 
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FIGURE 7.7 Estimated marginal posterior probabilities for N for the seal pup example. 


with the Metropolis—Hastings algorithm, we can use realizations from the chain to 
estimate the expectation of any function of X. 

It is possible to relate the Gibbs sampler to the Metropolis—Hastings algo- 
rithm, allowing for a proposal distribution in the Metropolis—Hastings algorithm 
that varies over time. Each Gibbs cycle consists of p Metropolis—Hastings steps. 
To see this, note that the ith Gibbs step in a cycle effectively proposes the candi- 


Sauer whet AP, X*, R SU x0) given the current state of 


date vector X* = 
Cen xt D (t) (t) 


the chain (« sereo Kip a Xp gees Xp i Thus, the ith univariate Gibbs update 


can be viewed as a Metropolis—Hastings step drawing 


(t+1) (+1) (t+1) (+1) (£) 
babs aA E 2g (fx ag A Creer 
where 
(t) cy (t) 
XIX if X*, =x, 
Gar: ieee) es pala ae et PAS) 
0 otherwise. 


It is easy to show that in this case the Metropolis—Hastings ratio equals 1, which 
means that the candidate is always accepted. 

The Gibbs sampler should not be applied when the dimensionality of X changes 
(e.g., when moving between models with different numbers of parameters at each 
iteration of the Gibbs sampler). Section 8.2 gives methods for constructing a suitable 
Markov chain with the correct stationary distribution in this case. 

The “Gibbs sampler” is actually a generic name for a rich family of very adapt- 
able algorithms. In the following subsections we describe various strategies that have 
been developed to improve the performance of the general algorithm described above. 
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7.2.3 Update Ordering 


The ordering of updates made to the components of X in the basic Gibbs sampler 
(7.9) can change from one cycle to the next. This is called random scan Gibbs sam- 
pling [417]. Randomly ordering each cycle can be effective when parameters are 
highly correlated. For example, Roberts and Sahu [546] give asymptotic results for a 
multilevel mixed model for which a random scan Gibbs sampling approach can yield 
faster convergence rates than the deterministic update ordering given in (7.9). In prac- 
tice without specialized knowledge for a particular model, we recommend trying both 
deterministic and random scan Gibbs sampling when parameters are highly correlated 
from one iterations to the next. 


7.2.4 Blocking 


Another modification to the Gibbs sampler is called blocking or grouping. In the 
Gibbs algorithm it is not necessary to treat each element of X individually. In the 
basic Gibbs sampler (7.9) with p = 4, for example, it would be allowable for each 
cycle to proceed with the following sequence of updates: 


x40] f(a 2,2), 
1 1 1 
xit ) x“ 1 sey (x2, x3 My a) 
1 1 1 1 
x“ 1 7; (x4 xf ) att dt ’) , 


In Example 7.4, we saw that the stream ecology parameters were naturally 
grouped into a conditionally multinomial set of parameters, a conditionally Dirichlet 
set of parameters, and a single conditionally Poisson element (7.10). It would be 
convenient and correct to cycle through these blocks, sequentially sampling from 
multivariate instead of univariate conditional distributions in the multinomial and 
Dirichlet cases. 

Blocking is typically useful when elements of X are correlated, with the algo- 
rithm constructed so that more correlated elements are sampled together in one block. 
Roberts and Sahu compare convergence rates for various blocking and update order- 
ing strategies [546]. The structured Markov chain Monte Carlo method of Sargent 
et al. offers a systematic approach to blocking that is directly motivated by the model 
structure [569]. This method has been shown to offer faster convergence for problems 
with a large number of parameters, such as Bayesian analyses for longitudinal and 
geostatistical data [110, 124]. 


7.2.5 Hybrid Gibbs Sampling 


For many problems the conditional distributions for one or more elements of X are 
not available in closed form. In this case, a hybrid MCMC algorithm can be developed 
where at a given step in the Gibbs sampler, the Metropolis—Hastings algorithm is used 
to sample from the appropriate conditional distribution. For example, for p = 5, a 
hybrid MCMC algorithm might proceed with the following sequence of updates: 
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1. Update X cl ia Pe ee x9) with a Gibbs step because this conditional 
distribution is available in closed form. 

2. Update (x$ me xe 2) | GE D x9, a) with a Metropolis—Hasting step be- 
cause this joint conditional distribution is difficult to sample from or is not 


available in closed form. Here, blocking X2 and X3 might be recommended 
because these elements are highly correlated. 


3. Update x D Fa D, x£" Dye me with a step from a random walk 
chain because this conditional distribution is not available in closed form. 


4. Update pee (sf D, xt D a me oe 2 with a Gibbs step. 


For both theoretical and practical reasons, only one Metropolis—Hastings step is per- 
formed at each step in the hybrid Gibbs sampler. Indeed, it has been proven that the 
basic Gibbs sampler in Section 7.2.1 is equivalent to the composition of p Metropolis— 
Hastings algorithms with acceptance probabilities equal to 1 [543]. The term “hybrid 
Gibbs” is rather generic terminology that is used to describe many different algo- 
rithms (see Section 8.4 for more examples). The example shown in steps 1—4 above 
is more precisely described as a“hybrid Gibbs sampler with Metropolis steps within 
Gibbs,” which is sometimes abbreviated as “Metropolis-within-Gibbs,” and was first 
proposed by [472]. 


Example7.7 (Fur Seal Pup Capture—Recapture Study, Continued) Example 7.6 
described the M(t) model in (7.11) for capture-recapture studies. For this model a 
common practice is to assume a Beta prior distribution for the capture probabilities 
and a noninformative Jeffreys prior for N, so f(N) « 1/N. For some datasets, pre- 
vious analyses have shown sensitivity to the values selected for 6; and 62 in (7.12) 
[230]. To mitigate this sensitivity, we consider an alternative setup with a joint distri- 
bution for (01, 02), namely f(61, 02) x exp {—(6; + 62)/1000} with (81, 62) assumed 
to be a priori independent of the remaining parameters. A Gibbs sampler can then be 
constructed by simulating from the conditional posterior distributions 


7 

N — 84]: ~ NegBin (84, 1-JJa- ai), (7.16) 
i=l 

ail: ~ Beta(c; + 01, N — ci + 62) fori=1,...,7, (7.17) 
7 

rO + 42)] T 6 o 01+ 
01, Ql ~ k | ———_— 1q — aj)” — ; 7.18 
br) Laren! [Tat a P1000 a 


where |- denotes conditioning on the remaining parameters from {N, œ, 0), 02} as well 
as the data in Table 7.1 and k is an unknown constant. Note that (7.18) is not easy to 
sample. This suggests using a hybrid Gibbs sampler with a Metropolis—Hastings step 
for (7.18). Thus the Gibbs sampler in (7.13)—(7.14) becomes a hybrid Gibbs sampler 
in (7.16)-(7.18) when a prior distribution is used for 0; and 62 instead of selecting 
values for these parameters. 
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7.2.6 Griddy—Gibbs Sampler 


Hybrid methods such as embedding Metropolis—Hastings steps within a Gibbs algo- 
rithm are one way to construct a Gibbs-like chain when not all the univariate condi- 
tionals are easily sampled. Other strategies, evolved from techniques in Chapter 6, 
can be used to sample difficult univariate conditionals. 

One such method is the griddy—Gibbs sampler [541, 624]. Suppose that it is dif- 
ficult to sample from the univariate conditional density for X;|x_, for a particular k. 
To implement a griddy—Gibbs step, select some grid points z1, ..., Zn over the range 
of support of f(-|x_x). Let we = f (zjix™,) for j = 1,...,n. Using these weights 
and the corresponding grid, one can approximate the density function f(-|x_,) or, 
equivalently, its inverse cumulative distribution function. Generate X aie [x from 
this approximation, and proceed with the remainder of the MCMC algorithm. The ap- 
proximation to the kth univariate conditional can be refined as iterations proceed. The 
simplest approach for the approximation and sampling step is to draw X ae |x from 


the discrete distribution on z1, ..., Zn with probabilities proportional to w, Rots wC ) 


using the inverse cumulative distribution function method (Section 6.2.2). A piece- 
wise linear cumulative distribution function could be generated from an approximat- 
ing density function that is piecewise constant between the midpoints of any two 
adjacent grid values with a density height set to ensure that the total probability on 
the segment containing z; is proportional to wr”, Other approaches could be based on 
the density estimation ideas presented in Chapter 10. 

If the approximation to f(-|x_,) is updated from time to time by improving the 
grid, then the chain is not time homogeneous. In this case, reference to convergence 
results for Metropolis—Hastings or Gibbs chains is not sufficient to guarantee that 
a griddy—Gibbs chain has a limiting stationary distribution equal to f. One way to 
ensure time homogeneity is to resist making any improvements to the approximat- 
ing univariate distribution as iterations progress. In this case, however, the limiting 
distribution of the chain is still not correct because it relies on an approximation to 
f(|x_x) rather than the true density. This can be corrected by reverting to a hybrid 
Metropolis-within-Gibbs framework where the variable generated from the approxi- 
mation to f(-|x_,) is viewed as a proposal, which is then randomly kept or discarded 
based on the Metropolis—Hastings ratio. Tanner discusses a wide variety of potential 
enhancements to the basic griddy—Gibbs strategy [624]. 


7.3 IMPLEMENTATION 


The goal of an MCMC analysis is to estimate features of the target distribution f. The 
reliability of such estimates depends on the extent to which sample averages computed 
using realizations of the chain correspond to their expectation under the limiting 
stationary distribution of the chain. Aside from griddy—Gibbs, all of the MCMC 
methods described above have the correct limiting stationary distribution. In practice, 
however, it is necessary to determine when the chain has run sufficiently long so that it 
is reasonable to believe that the output adequately represents the target distribution and 
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can be used reliably for estimation. Unfortunately, MCMC methods can sometimes be 
quite slow to converge, requiring extremely long runs, especially if the dimensionality 
of X is large. Further, it is often easy to be misled when using MCMC algorithm output 
to judge whether convergence has approximately been obtained. 

In this section, we examine questions about the long-run behavior of the chain. 
Has the chain run long enough? Is the first portion of the chain highly influenced 
by the starting value? Should the chain be run from several different starting values? 
Has the chain traversed all portions of the region of support of f? Are the sampled 
values approximate draws from f? How shall the chain output be used to produce 
estimates and assess their precision? Useful reviews of MCMC diagnostic methods 
include [76, 125, 364, 458, 543, 544]. We end with some practical advice for coding 
MCMC algorithms. 


7.3.1 Ensuring Good Mixing and Convergence 


It is important to consider how efficiently an MCMC algorithm provides useful infor- 
mation about a problem of interest. Efficiency can take on several meanings in this 
context, but here we will focus on how quickly the chain forgets its starting value and 
how quickly the chain fully explores the support of the target distribution. A related 
concern is how far apart in a sequence observations need to be before they can be 
considered to be approximately independent. These qualities can be described as the 
mixing properties of the chain. 

We must also be concerned whether the chain has approximately reached its 
stationary distribution. There is substantial overlap between the goals of diagnosing 
convergence to the stationary distribution and investigating the mixing properties 
of the chain. Many of the same diagnostics can be used to investigate both mixing 
and convergence. In addition, no diagnostic is fail-safe; some methods can suggest 
that a chain has approximately converged when it has not. For these reasons, we 
combine the discussion of mixing and convergence in the following subsections, and 
we recommend that a variety of diagnostic techniques be used. 


7.3.1.1 Simple Graphical Diagnostics After programming and running the 
MCMC algorithm from multiple starting points, users should perform various 
diagnostics to investigate the properties of the MCMC algorithm for the particular 
problem. Three simple diagnostics are discussed below. 

A sample path is a plot of the iteration number t versus the realizations of X®. 
Sample paths are sometimes called trace or history plots. If a chain is mixing poorly, 
it will remain at or near the same value for many iterations, as in the lower panel in 
Figure 7.2. A chain that is mixing well will quickly move away from its starting value 
and the sample path will wiggle about vigorously in the region supported by f. 

The cumulative sum (cusum) diagnostic assesses the convergence of an es- 
timator of a one-dimensional parameter 0 = E{h(X)} [678]. For n realizations of 
the chain after discarding some initial iterates, the estimator is given by 6, = 
(1/n) Xi- A(x). The cusum diagnostic is a plot of X-i [h (x) — Ôn] ver- 
sus ¢. If the final estimator will be computed using only the iterations of the chain that 
remain after removing some burn-in values (see Section 7.3.1.2), then the estimator 
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FIGURE 7.8 Autocorrelation function plots for independence chain of Example 7.2 with 
proposal densities Beta(1,1) (top) and Beta(2,10) (bottom). 


and cusum plot should be based only on the values to be used in the final estima- 
tor. Yu and Mykland [678] suggest that cusum plots that are very wiggly and have 
smaller excursions from 0 indicate that the chain is mixing well. Plots that have large 
excursions from 0 and are smoother suggest slower mixing speeds. The cusum plot 
shares one drawback with many other convergence diagnostics: For a multimodal 
distribution where the chain is stuck in one of the modes, the cusum plot may appear 
to indicate good performance when, in fact, the chain is not performing well. 

An autocorrelation plot summarizes the correlation in the sequence of X® at 
different iteration lags. The autocorrelation at lag i is the correlation between iterates 
that are i iterations apart [212]. A chain that has poor mixing properties will exhibit 
slow decay of the autocorrelation as the lag between iterations increases. For problems 
with more than one parameter it may also be of use to consider cross-correlations 
between parameters that might be related, since high cross-correlations may also 
indicate poor mixing of the chain. 


Example 7.8 (Mixture Distribution, Continued) Figure 7.8 shows autocorrela- 
tion function (acf) plots for the independence chain described in Example 7.2. In the 
top panel, the more appropriate proposal distribution yields a chain for which the 
autocorrelations decrease rather quickly. In the lower panel, the bad proposal distri- 
bution yields a chain for which autocorrelations are very high, with a correlation of 
0.92 for observations that are 40 iterations apart. This panel clearly indicates poor 
mixing. 


7.3.1.2 Burn-in and Run Length Key considerations in the diagnosis of con- 
vergence are the burn-in period and run length. Recall that it is only in the limit that 
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an MCMC algorithm yields X® ~ f. For any implementation, the iterates will not 
have exactly the correct marginal distribution, and the dependence on the initial point 
(or distribution) from which the chain was started may remain strong. To reduce the 
severity of this problem, the first D values from the chain are typically discarded as 
a burn-in period. 

A commonly used approach for the determination of an appropriate burn-in 
period and run length is that of Gelman and Rubin [221, 224]. This method is based 
on a Statistic motivated by an analysis of variance (ANOVA): The burn-in period or 
MCMC run-length should be increased if a between-chain variance is considerably 
larger than the within-chain variance. The variances are estimated based on the results 
of J runs of the MCMC algorithm to create separate, equal-length chains (J > 2) with 
starting values dispersed over the support of the target density. 

Let L denote the length of each chain after discarding D burn-in iterates. Sup- 
pose that the variable (e.g., parameter) of interest is X, and its value at the rth iteration 


of the jth chain is ee Thus, for the jth chain, the D values ce ete J FN are 
discarded and the L values as oe feet are retained. Let 
1 D+L-1 1 J 
ïj=7 S x and r=39 5 (7.19) 
t=D j=1 
and define the between-chain variance as 
L 4 2 
B=- 3 (z-z). (7.20) 
j=l 
Next define 
D+L-1 
1 2 
Ls (CO 
i c= > (x! -3;) 
t=D 
to be the within-chain variance for the jth chain. Then let 
=e a 7.21 
j=1 
represent the mean of the J within-chain estimated variances. Finally, let 
L — 1)/L]W + (1/L)B 
g WE = D/EIW + 0/DB oN 


W 


If all the chains are stationary, then both the numerator and the denominator should 
estimate the marginal variance of X. If, however, there are notable differences between 
the chains, then the numerator will exceed the denominator. 
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In theory, VR —> 1 as L > oo. In practice, the numerator in (7.22) is slightly 
too large and the denominator is slightly too small. An adjusted estimator is given by 


R= — R- 


Some authors suggest that VR < 1.1 indicates that the burn-in and chain length are 
sufficient [544]. Another useful convergence diagnostic is a plot of the values of Ẹ 
versus the number of iterations. When Ê has not stabilized near 1, this suggests lack 
of convergence. If the chosen burn-in period did not yield an acceptable result, then D 
should be increased, L should be increased, or preferably both. A conservative choice 
is to use one-half of the iterations for burn-in. The performance of this diagnostic is 
improved if the iterates g are transformed so that their distribution is approximately 
normal. Alternatively, a reparameterization of the model could be undertaken and the 
chain rerun. 

There are several potential difficulties with this approach. Selecting suitable 
starting values in cases of multimodal f may be difficult, and the procedure will not 
work if all of the chains become stuck in the same subregion or mode. Due to its uni- 
dimensionality, the method may also give a misleading impression of convergence for 
multidimensional target distributions. Enhancements of the Gelman—Rubin statistic 
are described in [71, 224], including an improved estimate of R in (7.22) that accounts 
for variability in unknown parameters. In practice, these improvements lead to very 
similar results. An extension for multidimensional target distributions is given in [71]. 

Raftery and Lewis [526] proposed a very different quantitative strategy for 
estimating run length and burn-in period. Some researchers advocate no burn-in [231]. 


7.3.1.3 Choice of Proposal As illustrated in Example 7.2, mixing is strongly 
affected by features of the proposal distribution, especially its spread. Further, advice 
on desirable features of a proposal distribution depends on the type of MCMC algo- 
rithm employed. 

For a general Metropolis—Hastings chain such as an independence chain, it 
seems intuitively clear that we wish the proposal distribution g to approximate the 
target distribution f very well, which in turn suggests that a very high rate of accepting 
proposals is desirable. Although we would like g to resemble f, the tail behavior of g 
is more important than its resemblance to f in regions of high density. In particular, if 
J/g is bounded, the convergence of the Markov chain to its stationary distribution is 
faster overall [543]. Thus, it is wiser to aim for a proposal distribution that is somewhat 
more diffuse than f. 

In practice, the variance of the proposal distribution can be selected through an 
informal iterative process. Start a chain, and monitor the proportion of proposals that 
have been accepted; then adjust the spread of the proposal distribution accordingly. 
After some predetermined acceptance rate is achieved, restart the chain using the 
appropriately scaled proposal distribution. For a Metropolis algorithm with normal 
target and proposal distributions, it has been suggested that an acceptance rate of 
between 25 and 50% should be preferred, with the best choice being about 44% for 
one-dimensional problems and decreasing to about 23.4% for higher-dimensional 
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problems [545, 549]. To apply such rules, care are must be taken to ensure that the 
target and proposal distributions are roughly normally distributed or at least simple, 
unimodal distributions. If, for example, the target distribution is multimodal, the chain 
may get stuck in one mode without adequate exploration of the other portions of the 
parameter space. In this case the acceptance rate may very high, but the probability 
of jumping from one mode to another may be low. This suggests one difficult issue 
with most MCMC methods; it is useful to have as much knowledge as possible about 
the target distribution, even though that distribution is typically unknown. 

Methods for adaptive Markov chain Monte Carlo (Section 8.1) tune the proposal 
distribution in the Metropolis algorithm during the MCMC algorithm. These methods 
have the advantage that they are automatic and, in some implementations, do not 
require the user to stop, tune, and restart the algorithm on multiple occasions. 


7.3.1.4 Reparameterization Model reparameterization can provide substantial 
improvements in the mixing behavior of MCMC algorithms. For a Gibbs sampler, 
performance is enhanced when components of X are as independent as possible. 
Reparameterization is the primary strategy for reducing dependence. For example, 
if f is a bivariate normal distribution with very strong positive correlation, both 
univariate conditionals will allow only small steps away from X = x along one 
axis. Therefore, the Gibbs sampler will explore f very slowly. However, suppose 
Y = (X1 + X2, Xi — X2). This transformation yields one univariate conditional on 
the axis of maximal variation in X and the second on an orthogonal axis. If we view 
the support of f as cigar shaped, then the univariate conditionals for Y allow one 
step along the length of the cigar, followed by one across its width. Therefore, the 
parameterization inherent in Y makes it far easier to move from one point supported 
by the target distribution to any other point in a single move (or a few moves). 

Different models require different reparameterization strategies. For example, 
if there are continuous covariates in a linear model, it is useful to center and scale the 
covariates to reduce correlations between the parameters in the model. For Bayesian 
treatment of linear models with random effects, hierarchical centering can be used to 
accelerate MCMC convergence [218, 219]. The term hierarchical centering comes 
from the idea that the parameters are centered as opposed to centering the covariates. 
Hierarchical centering involves reexpressing a linear model into another form that 
produces different conditional distributions for the Gibbs sampler. 


Example 7.9 (Hierarchical Centered Random Effects Model) For example, con- 
sider a study of pollutant levels where it is known that tests performed at different 
laboratories have different levels of measurement error. Let y;j be the pollutant level 
of the jth sample that was tested at the ith laboratory. We might consider a simple 
random effects model 


Vij = U + Qi + €ij (7.23) 
where i = 1,..., Z and j = 1, ...,n;. In the Bayesian paradigm, we might assume 


u ~ N(uo, o2), a; ~ N(0, o2), and €;; ~ N(O, o2). The hierarchical centered form 
of (7.23) is a simple reparameterization of the model with y;j = yi + €i) where 
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yi = u + qi and yiu ~ Nw, o2). Thus y is centered about u. Hierarchical cen- 
tering usually produces better behaved MCMC chains when o2 is not a lot larger 
than a, which is likely when random effects are deemed useful for modeling a given 
dataset. While this is a simple example, hierarchical centering has been shown to 
produce more efficient MCMC algorithms for more complex linear model problems 
such as generalized linear mixed models. However, the advantages of hierarchical 
centering can depend on the problem at hand and should be implemented on a case- 
by-case basis [77, 219]. See Problems 7.7 and 7.8 for another example of hierarchical 
centering. 


Unfortunately, reparameterization approaches are typically adapted for specific 
models, so it is difficult to provide generic advice. Another way to improve mixing 
and accelerate convergence of MCMC algorithms is to augment the problem us- 
ing so-called auxiliary variables; see Chapter 8. A variety of reparameterization and 
acceleration techniques are described in [106, 225, 242, 543]. 


7.3.1.5 Comparing Chains: Effective Sample Size If MCMC realizations 
are highly correlated, then the information gained from each iteration of the MCMC 
algorithm will be much less than suggested by the run length. The reduced information 
is equivalent to that contained in a smaller i.i.d. sample whose size is called the effective 
sample size. The difference between the total number of samples and the effective 
sample size indicates the efficiency lost when correlated samples from the Markov 
chain have been used to estimate a quantity of interest instead of an independent and 
identically distributed sample with the same variance as the observed sample [543]. 

To estimate the effective sample size, the first step is to compute the estimated 
autocorrelation time, a summary measure of the autocorrelation between realizations 
and their rate of decay. The autocorrelation time is given by 


t=1+25~ p(k), (7.24) 
k=1 


where p(k) is the autocorrelation between realizations that are k iterations apart (e.g., 
the correlation between X and X“t fort =1,..., L). Accurate estimation of p(k) 
presents its own challenges, buta common approachis to truncate the summation when 
p(k) < 0.1 [110]. Then the effective sample size for an MCMC run with L iterations 
after burn-in can be estimated using L/T. 

Effective sample size can be used to compare the efficiency of competing 
MCMC samplers for a given problem. For a fixed number of iterations, an MCMC 
algorithm with a larger effective sample size is likely to converge more quickly. For 
example, we may be interested in the gains achieved from blocking in a Gibbs sam- 
pler. If the blocked Gibbs sampler has a much higher effective sample size than the 
unblocked version, this suggests that the blocking has improved the efficiency of the 
MCMC algorithm. Effective sample size can also be used for a single chain. For exam- 
ple, consider a Bayesian model with two parameters (œ, 6) and an MCMC algorithm 
run for 10,000 iterations after burn-in. An effective sample size of, say, 9500 iterations 
for a suggests low correlations between iterations. In contrast, if the results indicated 


7.3 IMPLEMENTATION 225 


an effective sample size of 500 iterations for 6, this would suggest that convergence 
for $ is highly suspect. 


7.3.1.6 Number of Chains One of the most difficult problems to diagnose 
is whether or not the chain has become stuck in one or more modes of the target 
distribution. In this case, all convergence diagnostics may indicate that the chain has 
converged, though the chain does not fully represent the target distribution. A partial 
solution to this problem is to run multiple chains from diverse starting values and 
then compare the within- and between-chain behavior. A formal approach for doing 
this is described in Section 7.3.1.2. 

The general notion of running multiple chains to study between-chain perfor- 
mance is surprisingly contentious. One of the most vigorous debates during the early 
statistical development of MCMC methods centered around whether it was more im- 
portant to invest limited computing time in lengthening the run of a single chain, or 
in running several shorter chains from diverse starting points to check performance 
[224, 233, 458]. The motivation for trying multiple runs is the hope that all interest- 
ing features (e.g., modes) of the target distribution will be explored by at least one 
chain, and that the failure of individual chains to find such features or to wash out 
the influence of their starting values can be detected, in which case chains must be 
lengthened or the problem reparameterized to encourage better mixing. 

Arguments for one long chain include the following. Many short runs are more 
informative than one long run only when they indicate poor convergence behavior. In 
this case, the simulated values from the many short chains remain unusable. Second, 
the effectiveness of using many short runs to diagnose poor convergence is mainly 
limited to unrealistically simple problems or problems where the features of f are 
already well understood. Third, splitting computing effort into many short runs may 
yield an indication of poor convergence that would not have occurred if the total 
computing effort had been devoted to one longer run. 

We do not find the single-chain arguments entirely convincing from a practical 
point of view. Starting a number of shorter chains from diverse starting points is an 
essential component of thorough debugging of computer code. Some primary fea- 
tures of f (e.g., multimodality, highly constrained support region) are often broadly 
known—even in complex realistic problems—notwithstanding uncertainty about spe- 
cific details of these features. Results from diverse starts can also provide information 
about key features of f, which in turn helps determine whether the MCMC method and 
problem parameterization are suitable. Poor convergence of several short chains can 
help determine what aspects of chain performance will be most important to monitor 
when a longer run is made. Finally, CPU cycles are more abundant and less expensive 
than they were a decade ago. We can have diverse short runs and one longer run. 
Exploratory work can be carried out using several shorter chains started from various 
points covering the believed support of f. Diagnosis of chain behavior can be made 
using a variety of informal and formal techniques, using the techniques described in 
this chapter. After building confidence that the implementation is a promising one, it 
is advisable to run one final very long run from a good starting point to calculate and 
publish results. 
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7.3.2 Practical Implementation Advice 


The discussion above raises the question of what values should be used for the number 
of chains, the number of iterations for burn-in, and the length of the chain after 
burn-in. Most authors are reluctant to recommend generic values because appropriate 
choices are highly dependent on the problem at hand and the rate and efficiency with 
which the chain explores the region supported by f. Similarly, the choices are limited 
by how much computing time is available. Published analyses have used burn-ins 
from zero to tens of thousands and chain lengths from the thousands to the millions. 
Diagnostics usually rely on at least three, and typically more, multiple chains. As 
computing power continues to grow, so too will the scope and intensity of MCMC 
efforts. 

In summary, we reiterate our advice from Section 7.3.1.6 here, which in turn 
echoes [126]. First, create multiple trial runs of the chain from diverse starting values. 
Next, carry out a suite of diagnostic procedures like those discussed above to ensure 
that the chain appears to be well mixing and has approximately converged to the 
stationary distribution. Then, restart the chain for a final long run using a new seed to 
initiate the sampling. A popular, though conservative, choice for burn-in is to throw 
out the first half of the MCMC iterations as the burn-in. When each MCMC iteration 
is computationally expensive, users typically select much shorter burn-in lengths that 
conserve more iterations for inference. 

For learning about MCMC methods and chain behavior, nothing beats pro- 
gramming these algorithms from scratch. For easier implementation, various software 
packages have been developed to automate the development of MCMC algorithms 
and the related diagnostics. The most comprehensive software to date is the BUGS 
(Bayesian inference Using Gibbs Sampling) software family with developments for 
several platforms [610]. A popular application mode is to use BUGS within the R sta- 
tistical package [626]. Packages in R like CODA [511] and BOA [607] allow users to 
easily construct the relevant convergence diagnostics. Most of this software is freely 
available via the Internet. 


7.3.3 Using the Results 


We describe here some of the common summaries of MCMC algorithm output and 
continue the fur seal pup example for further illustration. 

The first topic to consider is marginalization. If {X} represents a 
p-dimensional Markov chain, then (x) is a Markov chain whose limiting distri- 
bution is the ith marginal of f. If you are focused only on a property of this marginal, 
discard the rest of the simulation and analyze the realizations of XP: Further, note 
that it is not necessary to run a chain for every quantity of interest. Post hoc inference 
about any quantity can be obtained from the realizations of X® generated by the 
chain. In particular, the probability for any event can be estimated by the frequency 
of that event in the chain. 

Standard one-number summary statistics such as means and variances are com- 
monly desired (see Section 7.1). The most commonly used estimator is based on an 
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empirical average. Discard the burn-in; then calculate the desired statistic by taking 


D+L-1 


1 y n (x®) (1.25) 
t=D 


as the estimator of E{h(X)}, where L denotes the length of each chain after discard- 
ing D burn-in iterates. This estimator is consistent even though the X® are serially 
correlated. There are asymptotic arguments in favor of using no burn-in (so D = 1) 
[231]. However, since a finite number of iterations is used to compute the estimator 
in (7.25), most researchers employ a burn-in to reduce the influence of the initial 
iterates sampled from a distribution that may be far from the target distribution. We 
recommend using a burn-in period. 

Other estimators have been developed. The Riemann sum estimator in (6.86) has 
been shown to have faster convergence than the standard estimator given above. Other 
variance reduction techniques discussed in Section 6.4, such as Rao—Blackwellization, 
can also be used to reduce the Monte Carlo variability of estimators based on the chain 
output [507]. 

The Monte Carlo, or simulation, standard error of an estimator is also of in- 
terest. This is an estimate of the variability in the estimator if the MCMC algorithm 
were to be run repeatedly. The naive estimate of the standard error for an estimator 
like (7.25) is the sample standard deviation of the L realizations after burn-in divided 
by VL. However, MCMC realizations are typically positively correlated, so this can 
underestimate the standard error. An obvious correction is to compute the standard er- 
ror based on a systematic subsample of, say, every kth iterate after burn-in. However, 
this approach is inefficient [429]. A simple estimator of the standard error is the batch 
method [92, 324]. Separate the L iterates into batches with b consecutive iterations 
in each batch. Compute the mean of each batch. Then the estimated standard error 
is the standard deviation of these means divided by the square root of the number of 
batches. A recommended batch size is b = [L!/“| where a = 2 or 3 and |z] denotes 
the largest integer less than z [355]. Other strategies to estimate Monte Carlo standard 
errors are surveyed in [196, 233, 609]. The Monte Carlo standard error can be used 
to assess the between-simulation variability. It has been suggested that, after deter- 
mining that the chain has good mixing and convergence behavior, you should run the 
chain until the Monte Carlo simulation error is less than 5% of the standard deviation 
for all parameters of interest [610]. 

Quantile estimates and other interval estimates are also commonly desired. 
Estimates of various quantiles such as the median or the fifth percentile of h(X) can 
be computed using the corresponding percentile of the realizations of the chain. This 
is simply implementing (7.25) for tail probabilities and inverting the relationship to 
find the quantile. 

For Bayesian analyses, computation of the highest posterior density (HPD) 
interval is often of interest (see Section 1.5). For a unimodal and symmetric poste- 
rior distribution, the (1 — æ)% HPD interval is given by the (@/2)th and (1 — a/2)th 
percentiles of the iterates. For a unimodal posterior distribution, an MCMC ap- 
proximation of the HPD interval can be computed as follows. For the parameter 
of interest, sort the MCMC realizations after burn-in, x), ..., x°t4— to obtain 
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Xa) < XQ) +++ < X(L-1). Compute the 100(1 — a)% credible intervals 


Ij = (xj), X(j+[(1—a)(L—1)])) for j = 1, 2 EEY (L 1) Ld a)(L 1)| 


where |z] represents the largest integer not greater than z. The 100(1 — œ)% HPD 
interval is the interval Z;ą with the shortest interval width among all credible intervals 
[107]. More sophisticated alternatives for HPD computation for multimodal posterior 
densities and other complexities are given in [106]. 

Simple graphical summaries of MCMC output should not be overlooked. 
Histograms of the realizations of h(X) for any h of interest are standard prac- 
tice. Alternatively, one can apply one of the density estimation techniques from 
Chapter 10 to summarize the collection of values. It is also common practice to 
investigate pairwise scatterplots and other descriptive plots to explore or illustrate 
key features of f. 


Example 7.10 (Fur Seal Pup Capture-Recapture Study, Continued) Recall the 
fur seal pup capture—recapture study in Example 7.6 that led to the Gibbs sampler 
summarized in (7.13) and (7.14). A hybrid Gibbs sampler for this problem is consid- 
ered in Example 7.7. When applied to the fur seal data these MCMC algorithms have 
very different performance. We will consider these two variations to demonstrate the 
MCMC diagnostics described above. 

For the basic Gibbs sampler in Example 7.6, the sample path and autocorrelation 
plots do not indicate any lack of convergence (Figure 7.9). Based on five runs of 
100,000 iterations each with a burn-in of 50,000, the Gelman—Rubin statistic for N is 
equal to 0.999995, which suggests the N chain is roughly stationary. The effective 
sample size was 45,206 samples (iterations). Similarly, for the hybrid sampler in 
Example 7.7 there is no evidence of lack of convergence for N, so we will not consider 
this parameter further. 

In contrast to the speedy convergence for N, MCMC convergence behavior for 
the capture probability parameters (a1, ..., @7) varies with the form of the model and 
Gibbs sampling strategy. For the uniform/Jeffreys prior combination and basic Gibbs 
sampler in Example 7.6, the Gelman—Rubin statistic for the capture probabilities are 
all close to one, and the capture probabilities exhibit little correlation between MCMC 
samples (e.g., lower right panel of Figure 7.9). This suggests that the chains are roughly 
stationary. However, as we will show below, the alternative prior distributions and the 
hybrid Gibbs sampler described in Example 7.7 lead to less satisfactory MCMC 
convergence behavior. 

Toimplement the hybrid Gibbs sampler for Example 7.7, a Metropolis—Hastings 
step is required to sample (61, 62) in (7.18). Note that the prior distribution for these 
parameters restricts (01, 02) to be larger than 0. Such a constraint can impede MCMC 
performance, particularly if there is high posterior density near the boundary. There- 
fore we consider using a random walk to update these parameters, but to improve 
performance we transform (01, 62) to U = (U1, U2) = (log 61, log 62). This permits 
a random walk step on (—0o, oo) to update U effectively. Specifically, proposal val- 
ues U* can be generated by drawing € ~ N(0, 0.08571) where I is the 2 x 2 identity 
matrix and then setting U* = u” + e. We select a standard deviation of 0.085 to get 
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FIGURE7.9 Output from the basic Gibbs sampler from Example 7.6. Top row: sample paths 
for last 1000 iterations of N (left) and a (right). Bottom row: autocorrelation plots after burn-in 
for N (left) and a (right). 


an acceptance rate of about 23% for the U updates. Recalling (7.8) in Example 7.3, 
it is necessary to transform (7.17) and (7.18) to reflect the change of variables. Thus, 
(7.17) becomes 


ai|- ~ Beta(c; + exp{ui}, N — ci + exp{u2}) fori=1,...,7, (7.26) 
and (7.18) becomes 


T(exp{wi} + exp{u2}) { (1.27) 


r(exp{u: Ðr(exp{u2}) 


7 
x gly aP exp { exp{u1} + exp{u2} \ 
i a AL ; 
II 1000 


U1, U2|- ~ku exp{u1 + uz} | 


where k, is an unknown constant. This method of transforming the parameter space via 
a change-of-variables method within the Metropolis—Hastings algorithm is useful for 
problems with constrained parameter spaces. The idea is to transform the constrained 
parameters so that the MCMC updates can be made on ʻA. See [329] for a more 
complex example. 

We implement the hybrid sampler running a chain of 100,000 iterations with 
the first 50,000 iterations discarded for burn-in. The Gelman—Rubin statistics for the 
parameters are all very close to 1, however, the autocorrelation plots indicate high 
correlation between iterations (left panel in Figure 7.10). For example, using the al- 
ternative prior distributions and the hybrid Gibbs sampler produces correlations of 
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FIGURE 7.10 For the hybrid Gibbs sampler from Example 7.7: Autocorrelation function 
plot for p, (left panel) and sample path for U (right panel) for final 5000 iterations in the seal 
pup example. 


0.6 between MCMC samples that are 40 iterations apart as compared to correlations 
near 0 at lags of 2 when using the uniform/Jeffreys prior combination and basic Gibbs 
sampler shown in Figure 7.9. Similarly, the effective sample size for the hybrid Gibbs 
algorithm of 1127 can be compared to the much larger effective sample size of 45,206 
in the basic Gibbs sampler discussed above. The right panel of Figure 7.10 shows 
the bivariate sample path of U a and u$ for the hybrid sampler. This plot indicates 
a high correlation between the parameters. These results suggest a lack of conver- 
gence of the MCMC algorithm or at least poor behavior of the chain for the hybrid 
algorithm. In spite of these indications of lack of convergence, the prior distributions 
from Example 7.7 produce very similar results to those from the uniform/Jeffreys 
prior combination for Example 7.6. The posterior mean of N is 90 with a 95% HPD 
interval of (85, 95). However, the chain does not mix as well as the simpler model, 
so we prefer the uniform/Jeffreys prior for the seal pup data. A hybrid Gibbs sampler 
can be quite effective for many problems, but for these data the alternative prior de- 
scribed in Example 7.7 is not appropriate and the hybrid algorithm does not remedy 
the problem. 


PROBLEMS 


7.1. The goal of this problem is to investigate the role of the proposal distribution in a 
Metropolis—Hastings algorithm designed to simulate from the posterior distribution of 
a parameter ô. In part (a), you are asked to simulate data from a distribution with 6 
known. For parts (b)—(d), assume ô is unknown with a Unif(0,1) prior distribution for 
ô. For parts (b)—-(d), provide an appropriate plot and a table summarizing the output of 
the algorithm. To facilitate comparisons, use the same number of iterations, random 
seed, starting values, and burn-in period for all implementations of the algorithm. 


a. Simulate 200 realizations from the mixture distribution in Equation (7.6) with 6 = 
0.7. Draw a histogram of these data. 


b. Implement an independence chain MCMC procedure to simulate from the posterior 
distribution of ô, using your data from part (a). 


7.2. 


7.3. 


7.4. 


7.5. 
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c. Implement a random walk chain with 6* = 6 + e with € ~Unif(—1,1). 


d. Reparameterize the problem letting U = log{6/(1 — 5)} and U* = u® + e. Imple- 
ment a random walk chain in U-space as in Equation (7.8). 


e. Compare the estimates and convergence behavior of the three algorithms. 


Simulating from the mixture distribution in Equation (7.6) is straightforward [see 
part (a) of Problem 7.1]. However, using the Metropolis—Hastings algorithm to simu- 
late realizations from this distribution is useful for exploring the role of the proposal 
distribution. 


a. Implement a Metropolis—Hastings algorithm to simulate from Equation (7.6) with 
ô = 0.7, using N(x, 0.017) as the proposal distribution. For each of three starting 
values, x® = 0, 7, and 15, run the chain for 10,000 iterations. Plot the sample path 
of the output from each chain. If only one of the sample paths was available, what 
would you conclude about the chain? For each of the simulations, create a histogram 
of the realizations with the true density superimposed on the histogram. Based on 
your output from all three chains, what can you say about the behavior of the chain? 


b. Now change the proposal distribution to improve the convergence properties of the 
chain. Using the new proposal distribution, repeat part (a). 


Consider a disk D of radius 1 inscribed within a square of perimeter 8 centered at the 
origin. Then the ratio of the area of the disk to that of the square is 7/4. Let f represent 
the uniform distribution on the square. Then for a sample of points (X;, Y;) ~ f(x, y) 
fori=l,....n,7= (4/n) eS lx; Ypen) is an estimator of x (where 14) is 1 if A 
is true and 0 otherwise). 

Consider the following strategy for estimating 7. Start with (x, y) = (0, 0). 
Thereafter, generate candidates as follows. First, generate ®© ~Unif(—h, h) and 
e ~Unif(—h, h). If O +O, yO + é\”) falls outside the square, regenerate e” and 
e” until the step taken remains within the square. Let (X+? , Y“F)) = (xO + €®, yO + 
e| )), Increment t. This generates a sample of points over the square. When t = n, stop 
and calculate 7 as given above. 


a. Implement this method for h = 1 and n = 20,000. Compute 7. What is the effect 
of increasing n? What is the effect of increasing and decreasing h? Comment. 


b. Explain why this method is flawed. Using the same method to generate candidates, 
develop the correct approach by referring to the Metropolis—Hastings ratio. Prove 
that your sampling approach has a stationary distribution that is uniform on the 
square. 


c. Implement your approach from part (b) and calculate 7. Experiment again with n 
and h. Comment. 


Derive the conditional distributions in Equation (7.10) and the univariate conditional 
distributions below Equation (7.10). 


A clinical trial was conducted to determine whether a hormone treatment benefits 
women who were treated previously for breast cancer. Each subject entered the clinical 
trial when she had a recurrence. She was then treated by irradiation and assigned to 
either a hormone therapy group or a control group. The observation of interest is the time 
until a second recurrence, which may be assumed to follow an exponential distribution 
with parameter t (hormone therapy group) or 0 (control group). Many of the women 
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TABLE 7.2 Breast cancer data. 


Hormone Treated Control 


Recurrence 2 4 6 9 9 9 1 4 6 7 13 24 


Times 


13 14 18 23 31 32 25 35 35 39 
33 34 83 


Censoring 10 14 14 16 17 18 1 1 3 4 5 8 


Times 


18 19 20 20 21 21 10 11 13 14 14 15 
23 24 29 29 30 30 17 19 20 22 24 24 
31 31 31 33 35 37 24 25 26 26 26 28 
40 4 42 42 44 46 29 29 32 35 38 39 
48 49 51 53 54 54 40 4 44 45 47 47 
55 56 47 50 50 51 


did not have a second recurrence before the clinical trial was concluded, so that their 
recurrence times are censored. 

In Table 7.2, a censoring time M means that a woman was observed for M 
months and did not have a recurrence during that time period, so that her recurrence 
time is known to exceed M months. For example, 15 women who received the hormone 
treatment suffered recurrences, and the total of their recurrence times is 280 months. 

Let y” = (x¥ 5") be the data for the ith person in the hormone group, where 
x}! is the time and ô” equals 1 if x” is a recurrence time and 0 if a censored time. The 
data for the control group can be written similarly. 

The likelihood is then 


L(0, PEPY OD Da (Z of) )exp{- DPE EoD 


You’ve been hired by the drug company to analyze their data. They want to know 
if the hormone treatment works, so the task is to find the marginal posterior distribution 
of t using the Gibbs sampler. In a Bayesian analysis of these data, use the conjugate 
prior 


(0, t) x 61" exp{—cd — d6t}. 
p 


Physicians who have worked extensively with this hormone treatment have indicated 
that reasonable values for the hyperparameters are (a, b, c, d) = (3, 1, 60, 120). 

a. Summarize and plot the data as appropriate. 

b. Derive the conditional distributions necessary to implement the Gibbs sampler. 


c. Program and run your Gibbs sampler. Use a suite of convergence diagnostics to 
evaluate the convergence and mixing of your sampler. Interpret the diagnostics. 


d. Compute summary statistics of the estimated joint posterior distribution, includ- 
ing marginal means, standard deviations, and 95% probability intervals for each 
parameter. Make a table of these results. 


e. Create a graph which shows the prior and estimated posterior distribution for t 
superimposed on the same scale. 


7.6. 


7.7. 
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f. Interpret your results for the drug company. Specifically, what does your estimate 
of t mean for the clinical trial? Are the recurrence times for the hormone group 
significantly different from those for the control group? 


g. A common criticism of Bayesian analyses is that the results are highly dependent 
on the priors. Investigate this issue by repeating your Gibbs sampler for values of 
the hyperparameters that are half and double the original hyperparameter values. 
Provide a table of summary statistics to compare your results. This is called a 
sensitivity analysis. Based on your results, what recommendations do you have 
for the drug company regarding the sensitivity of your results to hyperparameter 
values? 


Problem 6.4 introduces data on coal-mining disasters from 1851 to 1962. For these 
data, assume the model 


z Poisson(à1), j=1,...,0, (7.28) 
i Poisson(à2), j=0+1,...,112. j 


Assume i;|a@ ~ Gamma(3, œ) for i = 1,2, where œ ~ Gamma(10, 10), and assume 0 
follows a discrete uniform distribution over {1,..., 111}. The goal of this problem is 
to estimate the posterior distribution of the model parameters via a Gibbs sampler. 


a. Derive the conditional distributions necessary to carry out Gibbs sampling for the 
change-point model. 


b. Implement the Gibbs sampler. Use a suite of convergence diagnostics to evaluate 
the convergence and mixing of your sampler. 


c. Construct density histograms and a table of summary statistics for the approximate 
posterior distributions of 0, A;, and 42. Are symmetric HPD intervals appropriate 
for all of these parameters? 


d. Interpret the results in the context of the problem. 


Consider a hierarchical nested model 
Yik = + ai + Bjo + Eijks (7.29) 


wherei=1,...,/,j=1,...,J;,andk=1,..., K. After averaging over k for each 
i and j, we can rewrite the model (7.29) as 


Yj = u + ai + By + Eijs i=1,...,4, j=1,..., Ji (7.30) 


where Yj; = ye Yix/K. Assume that a; ~ N(O, 02), Bin ~ NO, Ops and €; ~ 
N(O, 0), where each set of parameters is independent a priori. Assume that a, Op and 
o2 are known. To carry out Bayesian inference for this model, assume an improper flat 
prior for u, so f(u) « 1. We consider two forms of the Gibbs sampler for this problem 
[546]: 


a. Letn = Ñ; Ji, y. = Xy yiz/n, and y;. = D yiz/ J; hereafter. Show that at itera- 
tion ż, the conditional distributions necessary to carry out Gibbs sampling for this 


234 CHAPTER7 MARKOV CHAIN MONTE CARLO 


model are given by 


2 
er| (a, By) ~N (y. = A foe — DD y, 


J© 


JV 
aft? (ut, B, y) ~n( = (v: p“? Eo) vi); 


J 


V; 

(+1) +1 +1 2 +1 (t+1) 

B (u (t ) ať y) ~N (3 (yi - nu! ) — al BAI 
€ 


JINS a 6 
Vitae ok ip and W=|5+5 : 
o o2 o o$ 


b. The convergence rate for a Gibbs sampler can sometimes be improved via repa- 
rameterization. For this model, the model can be reparameterized via hierarchical 
centering (Section 7.3.1.4). For example, let Y;; follow (7.30), but now let n;; = 
u + gi + By and Ej ~ N (0, o2). Then let y; = u + a; with nij|yi ~ N (vi. 03) 
and y;|u ~ N (u, 4, ae As above, assume o, Op and o2 are known, and assume a 
flat prior for u. Show that the conditional distributions necessary to carry out Gibbs 
sampling for this model are given by 


1 
u| (y D gO y) ~ n(j Lr. 7%): 


CDI D nl 1 o M 
Yi |(u a) .y) ~N| V3 =. ni + V3), 
o% o2 


yi yor) 
ne (aes y“, y) ~N (v ( a i ) ; v) , 
oé Op 


Aa eo i 
a op o i 


7.8. In Problem 7.7, you were asked to derive Gibbs samplers under two model parameter- 

izations. The goal of this problem is to compare the performance of the samplers. 

The website for this book provides a dataset on the moisture content in the 
manufacture of pigment paste [58]. Batches of the pigment were produced, and the 
moisture content of each batch was tested analytically. Consider data from 15 randomly 
selected batches of pigment. For each batch, two independent samples were randomly 
selected and each of these samples was measured twice. For the analyses below, let 
o2 = 86, op = 58, and o? = 1. 

Implement the two Gibbs samplers described below. To facilitate comparisons 
between the samplers, use the same number of iterations, random seed, starting values, 
and burn-in period for both implementations. 


where 


where 


a. Analyze these data by applying the Gibbs sampler from part (a) of Problem 7.7. 
Implement the sampler in blocks. For example, œ = (a1, ..., @15) is one block where 
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all parameters can be updated simultaneously because their conditional distributions 
are independent. Update the blocks using a deterministic order within each cycle. 
For example, generate u®, a, B® in sequence, followed by u”, w, B®, and 
so on. 


b. Analyze these data by applying the Gibbs sampler from part (b) of Problem 7.7. 
Implement the sampler and update the blocks using a deterministic order within 
each cycle, updating ©, y, 7 in sequence, followed by uw, yD, 7, and 
so on. 


c. Compare performance of the two algorithms by constructing the following diag- 
nostics for each of the above implementations. 


i. After deleting the burn-in iterations, compute the pairwise correlations between 
all parameters. 


ii. Select several of the parameters in each implementation, and construct an 
autocorrelation plot for each parameter. 


iii. Compare the effective sample size for several parameters for the two imple- 
mentations of the algorithm. 


You may also wish to explore other diagnostics to facilitate these comparisons. For 
this problem, do you recommend the standard or the reparameterized model? 


7.9. Example 7.10 describes the random walk implementation of the hybrid Gibbs sam- 
pler for the fur seal pup capture-recapture study. Derive the conditional distributions 
required for the Gibbs sampler, Equations (7.26) and (7.27). 


CHAPTER 8 


ADVANCED TOPICS IN MCMC 


The theory and practice of Markov chain Monte Carlo continues to advance at a 
rapid pace. Two particularly notable innovations are the dimension shifting reversible 
jump MCMC method and approaches for adapting proposal distributions while the 
algorithm is running. Also, applications for Bayesian inference continue to be of 
broad interest. In this chapter we survey a variety of higher level MCMC methods 
and explore some of the possible uses of MCMC to solve challenging statistical 
problems. 

Sections 8.1—8.5 introduce a wide variety of advanced MCMC topics, includ- 
ing adaptive, reversible jump, and auxiliary variable MCMC, additional Metropolis— 
Hasting methods, and perfect sampling methods. In Section 8.6 we discuss an ap- 
plication of MCMC to maximum likelihood estimation. We conclude the chapter in 
Section 8.7 with an example where several of these methods are applied to facilitate 
Bayesian inference for spatial or image data. 


8.1 ADAPTIVE MCMC 


One challenge with MCMC algorithms is that they often require tuning to improve 
convergence behavior. For example, in a Metropolis—Hastings algorithm with a nor- 
mally distributed proposal distribution, some trial and error is usually required to tune 
the variance of the proposal distribution to achieve an optimal acceptance rate (see 
Section 7.3.1.3). Tuning the proposal distribution becomes even more challenging 
when the number of parameters is large. Adaptive MCMC (AMCMC) algorithms 
allow for automatic tuning of the proposal distribution as iterations progress. 
Markov chain Monte Carlo algorithms that are adaptive have been considered 
for some time but formidable theory was typically required to prove stationarity of 
the resulting Markov chains. More recently, the development of simplified criteria for 
confirming theoretical convergence of proposed algorithms has led to an explosion of 
new adaptive MCMC algorithms [12, 16, 550]. Before describing these algorithms it 
is imperative to stress that care must be taken when developing and applying adaptive 
algorithms to ensure that the chain produced by the algorithm has the correct stationary 
distribution. Without such care, the adaptive algorithm will not produce a Markov 
chain because the entire path up to present time will be required to determine the 
present state. Another risk of adaptive algorithms is that they may depend too heavily 
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on previous iterations, thus impeding the algorithm from fully exploring the state 
space. The best adaptive MCMC algorithms solve these problems by progressively 
reducing the amount of tuning as the number of iterations increases. 

An MCMC algorithm with adaptive proposals is ergodic with respect to the 
target stationary distribution if it satisfies two conditions: diminishing adaptation and 
bounded convergence. Informally, diminishing (or vanishing) adaptation says that 
as t — oo, the parameters in the proposal distribution will depend less and less on 
earlier states of the chain. The diminishing adaptation condition can be met either 
by modifying the parameters in the proposal distribution by smaller amounts or by 
making the adaptations less frequently as t increases. The bounded convergence (con- 
tainment) condition considers the time until near convergence. Let D denote the 
total variation distance between the stationary distribution of the transition kernel 
employed by the AMCMC algorithm at time ¢ and the target stationary distribu- 
tion. (The total variation distance can be informally described as the largest possible 
distance between two probability distributions.) Let M (€) be the smallest t such 
that D® < e. The bounded convergence condition states that the stochastic process 
Me) is bounded in probability for any € > 0. The technical specifications of the 
diminishing adaptation and the bounded convergence conditions are beyond the scope 
of this book; see [550] for further discussion. However, in practice these conditions 
lead to simpler, verifiable conditions that are sufficient to guarantee ergodicity of the 
resulting chain with respect to the target stationary distribution and are easier to check. 
We describe these conditions for applications of specific AMCMC algorithms in the 
sections below. 


8.1.1 Adaptive Random Walk Metropolis-within-Gibbs 
Algorithm 


The method discussed in this section is a special case of the algorithm in Section 8.1.3, 
but we prefer to begin here at a simpler level. Consider a Gibbs sampler where the 
univariate conditional density for the ith element of X = (x l;e- X p) is not avail- 
able in closed form. In this case, we might use a random walk Metropolis algorithm 
to simulate draws from the ith univariate conditional density (Section 7.1.2). The 
goal of the AMCMC algorithm is to tune the variance of the proposal distribution 
so that the acceptance rate is optimal (i.e., the variance is neither too large nor too 
small). While many variants of the adaptive Metropolis-within-Gibbs algorithm are 
possible, we first consider an adaptive normal random walk Metropolis—Hastings 
algorithm [551]. 

In the algorithm below, the adaptation step is performed only at specific times, 
for example, iterations t € {50, 100, 150, ...}. We denote these as batch times Tp 
where b = 0, 1, ..., the proposal variance is first tuned at iteration T; = 50 and the 
next at iteration T2 = 100. The proposal distribution variance o$ will be changed at 
these times. Performing the adaptation step every 50 iterations is a common choice; 
other updating intervals are reasonable depending on the total number of MCMC 
iterations for a particular problem and the mixing performance of the chain. 

We present the adaptive random walk Metropolis-within-Gibbs algorithm as if 
the parameters were arranged so that the univariate conditional density for the first 
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element of X is not available in closed form. An adaptive Metropolis-within-Gibbs 
update is used for this element. We assume that the univariate conditional densities 
for the remaining elements of X permit standard Gibbs updates. The adaptive random 
walk Metropolis-within-Gibbs proceeds as follows. 


1. Initialization: Select starting values X® = x and set t = 0. Select a batching 
schedule {Tp} for b = 0, 1,2, ... and set a batch index b = 0. Let o$ zk 


2. Metropolis-within-Gibbs update: Update X o using a random walk update using 


the following steps: 
a. Generate X{ by drawing € ~ N(0, op) and then set X{ = xP +e. 


b. Compute the Metropolis—Hasting ratio 


R ey xt) wesc) (8.1) 


f(x?) 
c. Sample a value for X Sa according to the following: 


X% with probability min {R ee xi) l i} 


Da = 
x0 


otherwise. 


3. Gibbs updates: Since closed-form univariate conditional densities are available 
fori = 2,..., p, use Gibbs updates as follows: 
Generate, in turn, 


+1 +1 
x |. ~ f (xalx ty $ ae) ; 


t+1 t+1) (+1) (t 
x |. ~f (xsļxf eS ) xP : T) ; 


(HD). +D (+1) C+D 0) 
aT a Ta a a 
+1 (HI) (e+) (+1) 
x@ 1 ~ f (lef , X3 E 


where |- denotes conditioning on the most recent updates to all other elements 
of X. 


4. Adaptation step: when t = Tp+1, 


a. Update the variance of the proposal distribution 


log(op+1) = log(op) + ô(b + 1), 


where the adaptation factor 5(b + 1) is added when the Metropolis—Hastings 
acceptance rate in step 2(c) is smaller than 0.44 for the iterations in the pre- 
vious batch and subtracted otherwise. A common choice for the adaptation 
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factor is 6(b + 1) = min(0.01, 1/./7;), where 0.01 is an arbitrary constant 
that initially limits the magnitude of adaptation. 


b. Increment the batch index b = b + 1. 
5. Increment f and return to step 2. 


In the adaptation step, the variance of the proposal distribution is usually tuned 
with the goal of obtaining a proposal acceptance rate of about 0.44 (so, 44% of pro- 
posals are accepted) when X4 is univariate. This rate has been shown to be optimal for 
univariate normally distributed target and proposal distributions (see Section 7.3.1.3). 

As for any AMCMC implementation, we need to check the convergence crite- 
ria. For the Metropolis-within-Gibbs algorithm, the diminishing adaptation condition 
is satisfied when 6(b) —> 0 as b > oo. The bounded convergence condition is satis- 
fied if log(op) € [-M, M] where M < ov is some finite bound. Less stringent—but 
perhaps less intuitive—requirements also satisfy the bounded convergence condition; 
see [554]. 

The adaptive algorithm above can be generalized to other random walk distri- 
butions. Generally, in step 2(a), generate X7 by drawing € ~ h(e) for some density h. 
With this change, note that in step 2(b), the Metropolis—Hastings ratio will need to 
be adapted to include the proposal distribution as in (7.1) if h is not symmetric. 

The adaptive Metropolis-within-Gibbs algorithm is particularly useful when 
there are many parameters, each with its own variance to be tuned. For example, 
AMCMC methods have been used successfully for genetic data where the number 
of parameters can grow very quickly [637]. In that case, the algorithm above will 
need to be modified so that each element of X will have its own adaptation step and 
adaptive variance. We demonstrate a similar situation in Example 8.1. Alternatively, 
the proposal variance can be adapted jointly by accounting for the var{X}. This is 
discussed further in Section 8.1.3. 


8.1.2 General Adaptive Metropolis-within-Gibbs Algorithm 


The adaptive random walk Metropolis-within-Gibbs algorithm is a modification of 
the random walk algorithm (Section 7.1.2). Other adaptive forms of the Metropolis- 
within-Gibbs algorithm can be applied as long as the diminishing adaptation and the 
bounded convergence conditions are met. In the example below we develop one such 
algorithm for a realistic example. 


Example 8.1 (Whale Population Dynamics) Population dynamics models de- 
scribe changes in animal abundance over time. Natural mortality, reproduction, and 
human-based removals (e.g., catch) usually drive abundance trends. Another impor- 
tant concept in many such models is carrying capacity, which represents the num- 
ber of animals that can be sustained in equilibrium with the amount of resources 
available within the limited range inhabited by the population. As animal abundance 
increases toward (and potentially beyond) carrying capacity, there is greater compe- 
tition for limited resources, which reduces net population growth or even reverses 
it when abundance exceeds carrying capacity. This dependence of the population 
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growth rate on how near the current abundance is to carrying capacity is called density 
dependence. 
A simple discrete-time density-dependent population dynamics model is 


N 2 
Nyu1 = Ny —Cy +rNy (: - (=) (8.2) 


where Ny and Cy, respectively, represent abundance and catch in year y, r denotes 
the intrinsic growth rate, which encompasses both productivity and natural mortality, 
and K represents carrying capacity. This model is known as the Pella—Tomlinson 
model [504]. 

In application, this model should not be taken too literally. Abundances may 
be rounded to integers, but allowing fractional animals is also reasonable. Carrying 
capacity can be considered an abstraction, allowing the model to exhibit density- 
dependent dynamics rather than imposing an abrupt and absolute ceiling for abun- 
dance or allowing unlimited growth. Our implementation of (8.2) assumes that 
abundance is measured on the first day of the year and whales are harvested on the 
last day. 

Consider estimating the parameters in model (8.2) when one knows Cy in every 
year and has observed estimates of Ny for at least some years over the modeling period. 
When the population is believed to have been in equilibrium before the modeling 
period, it is natural to assume No = K, and we do so hereafter. In this case, the model 
contains two parameters: K and r. 

Let the observed estimates of Ny be denoted N y. For whales, abundance sur- 
veys are logistically challenging and usually require considerable expenditures of 
time, effort, and money, so Ny may be obtained only rarely. Thus, in this example 
based on artificial data there are only six observed abundance estimates, denoted 
N = {N}, ... No}. 

The website for this book provides catch data for 101 years, along with survey 
abundance estimates Ñ y for y € {14, 21, 63, 93, 100}. Each abundance estimate in- 
cludes an estimated coefficient of variation Yy. Conditionally on the Y y, let us assume 
that each abundance estimate N y is lognormally distributed as follows: 


log{Ny} ~ N(og{ Ny}, 6) (8.3) 


where o = log{1 + W}. Figure 8.1 shows the available data and the estimated pop- 
ulation trajectory using the maximum a posteriori estimate of r and K discussed later. 
For the purposes of this example, let us assume that Vy = y for all y and incorporate 
w as a third parameter in the model. The overall likelihood is written L(K, r, WIN). 
This setup conceals a serious challenge for analysis that is induced by two 
aspects of the data. First, the catches were huge in a brief early period, causing 
severe population decline, with subsequent small catches allowing substantial re- 
covery. Second, most of the available abundance estimates correspond either to the 
near present or to years that are near the population nadir that occurred many years 
ago. Together, these facts require that any population trajectory reasonably consistent 
with the observed data must “thread the needle” by using parameter values for which 
the trajectory passes through appropriately small abundances in the distant past and 
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FIGURE 8.1 Six abundance estimates and the maximum a posteriori population trajectory 
estimate for the whale dynamics model in Example 8.1. 


recovers to observed levels in the present. This situation forces a strong nonlinear 
dependence between K and r: for any K only a very narrow range of r values can 
produce acceptable population trajectories, especially when K is at the lower end of 
its feasible values. 

We adopt a Bayesian approach for estimating the model parameters using the 
independent priors 


K ~ Unif(7000, 100000), 
r ~ Unif(0.001, 0.1), 
w/2 ~ Beta(2, 10). 


These choices are based on research for other whale species and basic biological 
limitations such as gestation and reproduction limits. Denote the resultant joint prior 
distribution as p(K, r, Y). 

For posterior inference we will use a hybrid Gibbs approach (Section 7.2.5) 
since the univariate conditional distributions are not available in closed form. We 
update each parameter using a Metropolis—Hastings update. 

Let Gibbs cycles be indexed by t. The proposals for each parameter at cycle 
t + 1 are taken to be random Markov steps from their previous values. Specifically, 


D q +) 

K*=K9 +e’, 

r= ro) + et) 
r 


* y(t) (t+1) 
pay +e. 


The proposal distributions for the parameters, which we will denote gx, g,, and 


gy, are determined by the conditional distributions of | KO, +D | O, and 
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er |y. We denote those distributions as geg, g¢,, and ge ye For this example, 
we use 


ee (| K) x p (ek C+D. 0, 200) 7 (ek CHD e ESK) (8.4) 
ge, (€ Jar(etes,), (8.5) 
Bey (P| vw) œ 6 (ef*P30,0.1) 1 (eft? € Sy), (8.6) 


where the support regions of these densities are 


Sk = fek CHD. 7000 < KY + tt) < 100000}, 


SpE yee > max{0.001, r — 0.03} < r+ et) < min{0.1, P+ 0.03) } , 


Sy = festo 0< yO a < 2} : 


Lastly, I(z € Z) = lif z € Zand zero otherwise, and ¢(z; a, b) represents the normal 
distribution density for Z with mean a and standard deviation b. Note that ge, is simply 
a uniform distribution over S,. 

These proposal distributions (8.4)—(8.6) are sufficient to specify gx, g; and gy. 
Note that proposals are not symmetric in the sense that the probability density for 
proposing 6* from 6 is not the same as for proposing 6 from 6* for 6 € {K, r, Y}. 
This fact holds because in each case the truncation of the distribution of the pro- 
posed increment depends on the previous parameter value. Hence when calculat- 
ing transition probabilities one cannot ignore the direction of transition. Moreover, 
the setup described above is not a random walk because the Markov increments in 
(8.4)-(8.6) are not independent of the parameter values at the previous time. 

Before we consider an adaptive MCMC approach, let us review a standard 
implementation. At iteration t, a nonadaptive MCMC algorithm to sample from the 
posterior is given by: 


1. Define geg, g,, and Sey, as (8.4), (8.5), and (8.6). This step, which doesn’t de- 
pend on f¢, is included here nevertheless because when we switch to an adaptive 
method these definitions will change at each iteration. 


2. Sample from the increment distributions. This requires sampling ge KO, 
eft D)7, and th w from the distributions specified in step 1. 

3. Generate Kt" as follows: 
a. Propose K* = K“ + ere 
b. Calculate 


L ( K*, 1, yOIN 


N) p(K*. rO, W) sex (gt? | KO) 


5) (6.0. 9) = ATE) 


Rg = 
L(Ke, rO, alt) 
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c. Set K’+) = K* with probability equal to min{1, Rx}. Otherwise K¢*+) = 
KO. 


4. Generate r“+D as follows: 


a. Propose r* = r® + ft), 
b. Calculate 


L (KD, Pr, yO N) p(Ker, *, y) ee (Ca 7) 


Ñ) p(Ker, 0: VO) ge, ( — ettD 2 i 


c. Setr’+! = r* with probability equal to min{1, R,}. Otherwise r+) = 7, 


R, = 
L ( KOH), rO, yO 


5. Generate yt as follows: 
a. Propose y* = py + en" 
b. Calculate 


EKO, Ot) yt 


< 1 
N) (KO, prt), W*) Bey (e lyo) 
N) p(Ker, pte), yo) Bey ( = egt |y) . 


c. Set y+) = y* with probability equal to min{l, Ry}. Otherwise 
WD = yO, 


6. Increment f and return to step 1. 


Ry = 
L(K, HED), yl) 


Applying this algorithm for a chain length of 45,000 with a burn-in of 10,000 shows 
that the mixing properties of this chain are poor. For example, after burn-in the pro- 
posal acceptance rates are 81, 27, and 68% for K, r, and y, respectively. It would be 
more desirable to achieve acceptance rates near 44% (see Section 7.3.1.3). 

Now we try to improve MCMC performance using an adaptive approach. Re- 
define the proposal distributions from (8.4) — (8.6) as follows: 


ait (eft? |x) a o(<K*?:0, 20036") (ef PE Sk), 6D 
ater 70) g i z a), (8.8) 
att (ht? yi) m o(<y*?s0, 0.139?) ref? € Sy). (8.9) 


where 
Set!) = feo : max {0.001, rO 0.03} < rO p td 


< min {0.1, yO 4 0.03) bt 


Here, ee ary), and 8“*) are adaptation factors that vary as t increases. Thus, 
these equations allow the standard deviations of the normally distributed increments 
and the range of the uniformly distributed increment to decrease or increase over time. 
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For this example, we adjust so set), and hie every 1500th iteration. The 
expressions for rescaling the adaptation factors are 


(+1) Q) EK, 
tog (8+0) = log (3) + (8.11) 
80 = 108 Or t+)!’ ý 
(+1) O) wy 
t+ t 


where ge ut), and a are explained below. Thus by controlling {ux, ur, uy} 
we control how the proposal distributions adapt. 

Each adaptation depends on an acceptance rate, that is, the percentage of it- 
erations within a specified period for which the proposal is accepted. At a specific 


(tatl) _ 
6 = 


adaptation step fg we set u — 1 if the acceptance rate for 6 during the previous 


1500 iterations was less than 44% and aid = | otherwise, separately for the three 
parameters indexed as 6 € {K, r, Y}. Thus before generating the (ta + 1)th proposals 
we observe the separate acceptance rates arising from steps 3c, 4c, and 5c. above dur- 
ing {t : ta — 1500 < t < tg}. Then “ere ular), and aN are individually set to 
reflect the algorithm performance for each of the parameters during that time period. 
The u values may have different signs at any adaptation step, so the multiplicative 
factors ser slat), and a will increase and decrease separately as simulations 
progress. 

Using this approach, the adaptive MCMC algorithm for this example will follow 
the same six steps as above for advancing from f to t + 1 except that step 1 is replaced 
with 


1. If t € {1500, 3000, .. . 42000}, then 


a. Calculate the acceptance rates in the most recent 1500 iterations for each 
parameter separately and determine ae, uth, and Le 


b. Update 64°, 8+), and 69°! as in (8.10)-(8.12). 


c. Let gf), of) and gf") be updated as in (8.7)-(8.9). 


Otherwise, gt), gtt)), and ge remain unchanged from the previous 
iteration. 
The diminishing adaptation condition holds since u“+)/(t + 1)'/ > 0 in 


(8.10)-(8.12). The bounded convergence condition holds because these adaptations 
are restricted to a finite interval. Actually, we did not state this explicitly in our pre- 
sentation of the approach because in our example the adaptations settled down nicely 
without the need to impose bounds. 

Figure 8.2 shows how the acceptance rates for each parameter change as itera- 
tions progress. In this figure, we see that the initial proposal distributions for K and y 
are too concentrated, yielding insufficient exploration of the posterior distribution and 
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FIGURE 8.2 Trends in acceptance rate (top row) and the adapting proposal dispersion param- 
eters (bottom row) for the whale dynamics model in Example 8.1. The range of the horizontal 
axes is 0—45,000 and adaptations are made every 1500 iterations. 


acceptance rates that are too high. In contrast, the original proposal distribution for r 
is too broad, yielding acceptance rates that are too low. For all three parameters, as 
iterations progress the proposal distributions are adjusted to provide acceptance rates 


near 0.44. The evolution of these proposal distributions by means of adjusting 50, 80, 
and a is not monotonic and is subject to some Monte Carlo variation. Indeed, the 


u®, u®, and w change sign occasionally—but not necessarily simultaneously—as 


acceptance rates vary randomly for each block of iterations between adaptation steps. 
Nevertheless, the trends in 89, 80, and 89 are in the correct directions. 

Table 8.1 compares some results from the nonadaptive and adaptive approaches. 
The results in Table 8.1 are compiled over only the last 7500 iterations of the sim- 
ulation. For each of the three parameters, the table provides the lag 10 correlation 
within each chain. For K and r, these correlations are extremely high because—as 
noted above—the version of the model we use here forces K andr to lie within a very 
narrow nonlinear band of the joint parameter space. This makes it very difficult for the 
chain using independent steps for K and r to travel along this narrow high posterior 
probability ridge. Therefore it is difficult for univariate sample paths of K and r to 
travel quickly around the full extent of their marginal posterior distributions. The table 
also reports the average squared jumping distance (ASJD) for each parameter. The 
ASJD is the sum of the squared distances between proposed values and chosen values 
weighted by the acceptance probabilities. The adaptive method provided increased 
ASJDs and decreased lag correlations when acceptance rates were initially too high, 
but decreased ASJDs and increased lag correlations when acceptances were initially 
too rare. 
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TABLE 8.1 Comparison of mixing behavior for standard and adaptive Metropolis within Gibbs 
chains for Example 8.1. The ‘Baseline’ column shows the acceptance rate for the nonadaptive 
approach. The ASJD columns report average squared jumping distance, as discussed in the text. 


Lag 10 Correlation ASJD 
Baseline 
Parameter Accept. Rate Nonadaptive Adaptive Nonadaptive Adaptive 
K 81% 0.82 0.76 18,000 39,500 
r 27% 0.74 0.81 1.97 x 1075 1.44 x 1075 
y 68% 0.50 0.27 2.75 x 107° 4.20 x 107° 


While the adaptive Metropolis-within-Gibbs algorithm is simple to understand 
and to apply, it ignores the correlations between the parameters. More sophisticated 
adaptive algorithms can have better convergence properties. The adaptive Metropolis 
algorithm incorporates this correlation into the adaptation. 


8.1.3 Adaptive Metropolis Algorithm 


In this chapter and the previous chapter, we have stressed that a good proposal distribu- 
tion produces candidate values that cover the support of the stationary distribution in 
a reasonable number of iterations and produces candidate values that are not accepted 
or rejected too frequently. The goal of the adaptive Metropolis algorithm is to estimate 
the variance of the proposal distribution during the algorithm, adapting it in pursuit 
of the the optimal acceptance rate. In particular, the adaptive Metropolis algorithm is 
a one-step random walk Metropolis algorithm (Section 7.1.2) with a normal proposal 
distribution whose variance is calibrated using previous iterations of the chain. 

Consider a normal random walk update for a p-dimensional X via the Metropo- 
lis algorithm [12]. At each iteration of the chain, a candidate value X* is sampled 
from a proposal distribution N(X, 4X). The goal is to adapt the covariance matrix 
>” of the proposal distribution during the algorithm. For a d-dimensional spherical 
multivariate normal target distribution where Xy is the true covariance matrix of the 
target distribution, the proposal distribution (2.387/p)X has been shown to be op- 
timal with a corresponding acceptance rate of 44% when p = 1, which decreases to 
23% as p increases [223]. Thus, in one version of the adaptive Metropolis algorithm, 
À is set to (2.387/p) [288]. Since £y is unknown, it is estimated based on previous 
iterations of the chain. An adaptation parameter y) is used to blend © and D¢+) in 
such a way that the diminishing adaptation condition will be upheld. A parameter pu? 
is also introduced and estimated adaptively. This is used to estimate the covariance 
matrix 5, since var{X} = E {[X — ][X — p]"}. 

This adaptive Metropolis algorithm begins at t = 0 with the selection of X = 
x drawn at random from some starting distribution g, with the requirement that 
f (x) > 0 where f is the target distribution. Similarly, initialize u® and 5; 
common choices are 1 = 0 and 5® = I. Given X =x, u®, and D©, the 
algorithm generates X“+" as follows: 


1. Sample a candidate value X* from the proposal distribution N(X, 25%), 
where A is set to (2.38?/ p) for the basic implementation of the algorithm. 
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2. Select the value for X“+ according to 


gb) b with probability min { R (x, X*), 1}, 


x otherwise, 


where R (x®, X*) is the Metropolis-Hastings ratio given in (7.1). 


3. Adaptation step: Update the proposal distribution variance in two steps: 


wed = yO 4 yer (xe = 2) 


DHD = pO 4 yr) (xe S uo) (xen = pO)” _ Z| 


Here y+! is an adaptation parameter with values chosen by the user. For 
example, y“+)) = 1/(t + 1) is a reasonable choice. 


4. Increment ż and return to step 1. 


The updating formula for © is constructed so that it is computationally quick 
to calculate and so that the adaptation diminishes as the number of iterations increases. 
To uphold the diminishing adaptation condition, it is required that lim; y = 0. 
The additional condition that )>°°,) y = 00 allows the sequence © to move an 
infinite distance from its initial value [16]. 

Adapting both the mean and the variance of the proposal distribution in the 
adaptation step may be overkill, but it can have some advantages [12]. Specifically, 
the strategy can result in a more conservative sampler in the sense the sampler may 
resist large moves to poor regions of the parameter space. 

Several enhancements to the adaptive Metropolis algorithm may improve per- 
formance. It may make sense to adapt A as well as adapt ©” during the algorithm. 
In this enhancement A is replaced by A“, and then A“ and £® are updated inde- 
pendently. Specifically, in step 3 of the adaptive Metropolis algorithm “t+” is also 
updated using 


log oy) = log (a) 4 yer (Rr a x*) a a) , (8.13) 


where a denotes the target acceptance rate (e.g., 0.234 for higher-dimensional 
problems). 

One drawback with the adaptive Metropolis algorithm above is that all compo- 
nents are accepted or rejected simultaneously. This is not efficient when the problem is 
high dimensional (i.e., large p). An alternative is to develop a componentwise hybrid 
Gibbs adaptive Metropolis algorithm where each component is given its own scaling 
parameter 4+) and is accepted/rejected separately in step 2. In this case, the constant 
a in (8.13) is usually set to a higher value, for example, a = 0.44, since now com- 
ponents are updated individually (see Section 7.3.1.3). In addition, the components 
could be updated in random order; see random scan Gibbs sampling in Section 7.2.3. 
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Another variation is to carry out the adaptation in batches instead of 
every iteration. With the batching strategy, the adaptation (step 3 in the adap- 
tive Metropolis algorithm) is only implemented at predetermined times {Tp} for 
b=0,1,2,.... For example, the adaptation step could occur at a fixed interval, 
such as t € {50, 100, 150, ...}. Alternatively, the batching scheduling could be de- 
signed with an increasing number of iterations between adaptations, for example, 
t € {50, 150, 300, 500, ...}. A batch approach was used in Example 8.1. 

The batchwise adaptive Metropolis algorithm is described below. 


1. Initialization: Select starting values X® = x and set t = 0. Select a batching 
schedule {Tp} for b = 0,1, 2,... with Tọ = O. Set a batch index b = 0. Select 
starting values for the adaptation parameters 1 and £®; commonly used 
starting values are uw = 0 and 5) = I. 


2. Sample candidate X* from the proposal distribution N (x, Ax), 


3. Select a value for X“+" according to 


xD) X* with probability min { R (x, X*) , 1}, 
~ ) x — otherwise, 


where R (x, X*) is given in (7.1). 
4. When t = Tp+1, perform the adaptation steps: 


a. Update the proposal distribution: 


1 Th+1 

Ord = yO 4 OF 5 (x? a 2) 

b+1 b TH 
zD yO! 

Th+1 — Tp 

Tp+1 T 
x 5 (x = pw) (x? = pw) E z) , 
j=Tp+1 


b. Increment the batch index b = b + 1. 
5. Increment f and return to step 2. 


Note that in this algorithm, the diminishing adaptation condition is upheld when 
limp_+o0(Th+1 — Tp) = 00. It has also been suggested that adaptation times could be 
selected randomly. For example, one could carry out the adaptation with probability 
p© where lim;.o0 p = 0 ensures diminishing adaptation [550]. 

The AMCMC algorithms described here all share the property that they are 
time inhomogeneous, that is, the proposal distributions change over time. Other time- 
inhomogeneous MCMC algorithms are described in Section 8.4. 

A wide variety of other AMCMC algorithms have been suggested. This area 
will likely continue to develop rapidly [12, 16, 288, 551]. 
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8.2 REVERSIBLE JUMP MCMC 


In Chapter 7 we considered MCMC methods for simulating X fort = 1,2,... from 
a Markov chain with stationary distribution f. The methods described in Chapter 7 
required that the dimensionality of X® (i.e., of its state space) and the interpretation 
of the elements of X® do not change with ¢. In many applications, it may be of in- 
terest to develop a chain that allows for changes in the dimension of the parameter 
space from one iteration to the next. Green’s reversible jump Markov chain Monte 
Carlo (RJMCMC) method permits transdimensional Markov chain Monte Carlo sim- 
ulation [278]. We discuss this approach below in the context of Bayesian model un- 
certainty. The full generality of RJMCMC is described in many of the references 
cited here. 

Consider constructing a Markov chain to explore a space of candidate models, 
each of which might be used to fit observed data y. Let M),..., Mx denote a 
countable collection of models under consideration. A parameter vector Om denotes 
the parameters in the mth model. Different models may have different numbers of 
parameters, so we let pm denote the number of parameters in the mth model. In the 
Bayesian paradigm, we may envision random variables X = (M, 0m) which together 
index the model and parameterize inference for that model. We may assign prior 
distributions to these parameters, then seek to simulate from their posterior distribution 


using an MCMC method for which the fth random draw is X® = (m o, oOo). 


Here Os: which denotes the parameters drawn for the model indexed by M, has 
dimension pœ that can vary with t. 

Thus, the goal of RIMCMC is to generate samples with joint posterior density 
f(m, mly). This posterior arises from Bayes’ theorem via 


fn, Only) x f (yl, Om) f Om|m) fm), (8.14) 


where f (y|m, 0m) denotes the density of the observed data under the mth model 
and its parameters, f (0,,|m) denotes the prior density for the parameters in the mth 
model, and f(m) denotes the prior density of the mth model. A prior weight of f(m) 
is assigned to the mth model so Spore f(m) = 1. 

The posterior factorization 


fm, Only) = f (nly) f Omlm, y) (8.15) 


suggests two important types of inference. First, f (m|y) can be interpreted as the 
posterior probability for the mth model, normalized over all models under con- 
sideration. Second, f (@n|m, y) is the posterior density of the parameters in the 
mth model. 

RJMCMC enables the construction of an appropriate Markov chain for X that 
jumps between models with parameter spaces of different dimensions. Like simpler 
MCMC methods, RJIMCMC proceeds with the generation of a proposed step from the 
current value x to X*, and then a decision whether to accept the proposal or to keep 
another copy of x. The stationary distribution for our chain will be the posterior in 
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(8.15) if the chain is constructed so that 


fm, Oni |pa(mz2, Om |mı, Omi >, y) = f(m, Om |y)a(mı, On; |m2, Ono; y) 


for all mı and m2, where a(x2|x;, Y) denotes the density for the chain moving to 
state X2 = (m2, 6,,,) at time t + 1, given that it was in state x; = (m1, 0,,,) at time t. 
Chains that meet this detailed balance condition are termed reversible because the 
direction of time does not matter in the dynamics of the chain. 

The key to the RJMCMC algorithm is the introduction of auxiliary random 
variables at times ¢ and ¢ + 1 with dimensions chosen so that the augmented variables 
(namely X and the auxiliary variables) at times t and t + 1 have equal dimensions. We 
can then construct a Markov transition for the augmented variable at time ¢ that main- 
tains dimensionality. This dimension-matching strategy enables the time-reversibility 
condition to be met by using a suitable acceptance probability, thereby ensuring that 
the Markov chain converges to the joint posterior for X. Details of the limiting theory 
for these chains are given in [278, 279]. 

To understand dimension matching, it is simplest to begin by considering how 
one might propose parameters 02 corresponding to a proposed move from a model 
Mı with pı parameters to a model M32 with p2 parameters when p2 > pı. A simple 
approach is to generate 02 from an invertible deterministic function of both 0; and 
an independent random component U;. We can write 02 = q1,2(01, U1). Proposing 
parameters for the reverse move can be carried out via the inverse transformation, 
(01, U1) = q73(02) = qz,1(82). Note that qo, is an entirely deterministic way to 
propose 6; from a given 02. 

Now generalize this idea to generate an augmented candidate parameter vector 
(O7,« and auxiliary variables U*), given a proposed move to M* from the current 
model, m“), We can apply an invertible deterministic function qr,» to 6 and some 
auxiliary random variables U to generate 


(Oh, U*) = qr(0®, U), (8.16) 


where U is generated from proposal density A(-|m®, 0, m*). The auxiliary variables 
U* and U are used so that q;,, maintains dimensionality during the Markov chain 
transition at time f¢, but are discarded subsequently. 

When py» = P yw» the approach in (8.16) allows familiar proposal strategies. 


For example, a random walk could be obtained using (04,«, U*) = (0 + U, U) with 
U ~ N(0, o?I) having dimension p mo: Alternatively, a Metropolis—Hastings chain 
can be constructed by using 04+ = qi,.(U) when py = p m», for an appropriate func- 
tional form of qz, and suitable U. No U* would be required to equalize dimensions. 
When pjo < Pm» the U can be used to expand parameter dimensionality; U* may 
or may not be necessary to equalize dimensions, depending on the strategy employed. 
When pyw > Pm» both U and U* may be unnecessary: for example, the simplest 
dimension reduction is deterministically to reassign some elements of 0“ to U* and 
retain the rest for 64). In all these cases, the reverse proposal is again obtained from 
the inverse of qr». 
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Assume that the chain is currently visiting model m™, so the chain is in the state 
xO = (m®, ) ay). The next iteration of the RIMCMC algorithm can be summarized 
as follows: 


1. Sample a candidate model M*|m from a proposal density with conditional 
density g(-|m). The candidate model requires parameters m+ of dimen- 
sion py». 


2. Given M* = m*, generate an augmenting variable Ul|(m O96 m*) from a 


mit)? 


proposal distribution with density h(-|m™, 05; m*). Let 
) 
Oje U9 = dix (09), 


where q;,x is an invertible mapping from (0. u) to (0%, U*) and the aux- 
iliary variables have dimensions satisfying pmo + Py = Pm* + Py». 
3. For a proposed model, M* = m*, and the corresponding proposed parameter 


values 0;,., compute the Metropolis—Hastings ratio given by 


f(m*, One ly) g(m® |m*)h(u*|m*, Bn xm) 


J(t 8.17 
f(m®, 0°, Y)g (m*|m®)h(ulm®, opm *) JO, ( ) 


where J(f) is the Jacobian matrix described in Section 1.1, 


Mos dqr,» f (0, u) 
E GOW e, K 


Accept the move to the model M* with probability equal to the minimum 
of 1 and the expression in (8.17). If the proposal is accepted, set X“+!) = 
(M*, 04,«). Otherwise, reject the candidate draw and set X+) = x, 


4. Discard U and U*. Return to step 1. 


The last term in (8.17) is the absolute value of the determinant of the Jacobian matrix 


0j, U) to (B+, U*)- E Puo = Paes 


then (8.17) simplifies to the standard Metropolis—Hastings ratio (7.1). Note that it is 
implicitly assumed here that the transformation q;,, is differentiable. 


arising from the change of variables from ( 


Example 8.2 (Jumping between Two Simple Models) An elementary example 
illustrates some of the details described above [278]. Consider a problem with K = 2 
possible models: The model Mı has a one-dimensional parameter space 6; = a, and 
the model M2 has a two-dimensional parameter space 62 = (£, y). Thus pı = 1 and 
p2 = 2. Let mı = 1 and m = 2. 

If the chain is currently in state (1, 01) and the model M2 is proposed, then a 
random variable U ~ h(u|1, 61, 2) is generated from some proposal density h. Let 
B=a-—Uandy=a+U, so q1,2 (a, u) = (œ — u,a + u) and 


|dqi.2(a, u)/d(@, u)| = 2. 
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If the chain is currently in state (2, 02) and M, is proposed, then (œ, u) = 
q2.1(8, y) = (6, E) is the inverse mapping. Therefore 


1 
|dqz,1(B, y)/d(B, y)| = z 


and U* is not required to match dimensions. This transition is entirely deterministic, 
so we replace h(u*|2, 92, 1) in (8.17) with 1. 

Thus for a proposed move from M to M2, the Metropolis—Hasting ratio (8.17) 
is equal to 


fQ, B, yl¥) g 12) 
fa, @]¥) g(2| DAC, 81,2) 


(8.19) 


The Metropolis—Hastings ratio equals the reciprocal of (8.19) for a proposed move 


from M2 to My. 


There are several significant challenges to implementing RJMCMC. Since the 
number of dimensions can be enormous, it can be critical to select an appropriate pro- 
posal distribution h and to construct efficient moves between the different dimensions 
of the model space. Another challenge is the diagnosis of convergence for RIMCMC 
algorithms. Research in these areas is ongoing [72—74, 427, 604]. 

RJMCMC is a very general method, and reversible jump methods have been 
developed for a myriad of application areas, including model selection and parameter 
estimation for linear regression [148], variable and link selection for generalized linear 
models [487], selection of the number of components in a mixture distribution [74, 
536, 570], knot selection and other applications in nonparametric regression [48, 162, 
334], and model determination for graphical models [147, 248]. There are many other 
areas for potential application of RIMCMC. Genetic mapping was an early area of 
exploration for RIMCMC implementation; [122, 645, 648] there are claims that up 
to 20% of citations related to RIMCMC involve genetics applications [603]. 

RJMCMC unifies earlier MCMC methods to compare models with different 
numbers of parameters. For example, earlier methods for Bayesian model selection 
and model averaging for linear regression analysis, such as stochastic search variable 
selection [229] and MCMC model composition [527], can be shown to be special 
cases of RJMCMC [119]. 


8.2.1 RJMCMC for Variable Selection in Regression 


Consider a multiple linear regression problem with p potential predictor variables 
in addition to the intercept. A fundamental problem in regression is selection of a 
suitable model. Let mg denote model k, which is defined by its inclusion of the i; th 
through ith predictors, where the indices {i1,..., ig} are a subset of {1,..., p}. For 
the consideration of all possible subsets of p predictor variables, there are therefore 
K = 2? models under consideration. Using standard regression notation, let Y denote 
the vector of n independent responses. For any model mg, arrange the corresponding 
predictors in a design matrix Xm, = (1 Xi, Xu) where Xi; is the n vector of 
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observations of the i ;th predictor. The predictor data are assumed fixed. We seek the 
best ordinary least squares model of the form 


Y = XmnBmy + € (8.20) 


among all mg, where B,,, is a parameter vector corresponding to the design matrix 
for m; and the error variance is o°. In the remainder of this section, conditioning on 
the predictor data is assumed. 

The notion of what model is best may have any of several meanings. In Exam- 
ple 3.2 the goal was to use the Akaike information criterion (AIC) to select the best 
model [7, 86]. Here, we adopt a Bayesian approach for variable selection with priors 
on the regression coefficients and 07, and with the prior for the coefficients depending 
on o°. The immediate goal is to select the most promising subset of predictor vari- 
ables, but we will also show how the output of an RJIMCMC algorithm can be used to 
estimate a whole array of quantities of interest such as posterior model probabilities, 
the posterior distribution of the parameters of each model under consideration, and 
model-averaged estimates of various quantities of interest. 

In our RIMCMC implementation, based on [119, 527], each iteration begins 
at a model m"), which is described by a specific subset of predictor variables. To 
advance one iteration, the model M* is proposed from among those models having 
either one predictor variable more or one predictor variable fewer than the current 
model. Thus the model proposal distribution is given by g (- |m), where 


1 
ae — if M* has one more or one fewer predictor than m®, 
E (m |m ) =4 P 


0 otherwise. 


Given a proposed model M* = m*, step 2 of the RIMCMC algorithm requires us to 
sample U| (m®, Aas m*) ~h(-|m®, ae m*). A simplifying approach is to let U 
become the next value for the parameter vector, in which case we may set the proposal 
distribution h equal to the posterior for B,,,|(m, y), namely f(,,,|m, y). For appropriate 
conjugate priors, B*,.|(m*, y) has a noncentral ż distribution [58]. We draw U from 


this proposal and set B*. = U and U* = ere Thus qr, = (Bo U) = (f%+, U*), 


yielding a Jacobian of 1. Since g (m|m*) = g (m*|m) = 1/p, the Metropolis- 
Hastings ratio in (8.17) can be written as 


f(y|m*, Baw) f( ae m*) f(m”) £(B, m®,y) 
F (ym, BO) £(B% mn) £(m®) F(Bine|m*, y) 
7 Fim F(a oy 


after simplification. Here f (y|m*) is the marginal likelihood, and f (m*) is the prior 
density, for the model m*. Observe that this ratio does not depend on Bes or bmo. 
Therefore, when implementing this approach with conjugate priors, one can treat 
the proposal and acceptance of 8 as a purely conceptual construct useful for placing 
the algorithm in the RJMCMC framework. In other words, we don’t need to sim- 
ulate B |m®, because f (B|m, y) is available in closed form. The posterior model 
probabilities and f (|m, y) fully determine the joint posterior. 
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TABLE 8.2 RJMCMC model selection results for baseball example: the five models 
with the highest posterior model probability (PMP). The bullets indicate inclusion of 
the corresponding predictor in the given model, using the predictor indices given in 


Table 3.2. 
Predictors 
3 4 8 10 13 14 24 PMP 
e e ° ° e 0.22 
° ° ° ° e 0.08 
° ° ° e 0.05 
e e ° ° ° ° 0.04 
e e e e e ° 0.03 


After running the RJMCMC algorithm, inference about many quantities of in- 
terest is possible. For example, from (8.15) the posterior model probabilities f(mg|y) 
can be approximated by the ratio of the number of times the chain visited the kth 
model to the number of iterations of the chain. These estimated posterior model prob- 
abilities can be used to select models. In addition, the output from the RIJMCMC 
algorithm can be used to implement Bayesian model averaging. For example, if u is 
some quantity of interest such as a future observable, the utility of a course of action, 
or an effect size, then the posterior distribution of u given the data is given by 


K 
f (uly) = XL f Gale. y) f (mly). (8.22) 
k=1 
This is the average of the posterior distribution for u under each model, weighted by 
the posterior model probability. It has been shown that taking account of uncertainty 
about the form of the model can protect against underestimation of uncertainty [331]. 


Example 8.3 (Baseball Salaries, Continued) Recall Example 3.3, where we 
sought the best subset among 27 possible predictors to use in linear regression mod- 
eling of baseball players’ salaries. Previously, the objective was to find the best subset 
as measured by the minimal AIC value. Here, we seek the best subset as measured 
by the model with the highest posterior model probability. 

We adopt a uniform prior over model space, assigning f(mg) = 2~? for each 
model. For the remaining parameters, we use a normal-gamma conjugate class of pri- 
ors with Big [mye ~ N(Qin;; o?V m) and và /o? ~ x For this construction, f (y|mx) 
in (8.21) can be shown to be the noncentral t density (Problem 8.1). For the baseball 
data, the hyperparameters are set as follows. First, let v = 2.58 and à = 0.28. Next, 
Am, = (Bo. 0,..., 0) is a vector of length pm, whose first element equals the least 
squares estimated intercept from the full model. Finally, Vm, is a diagonal matrix with 
entries (e c? jsi, a /s3) , where se is the sample variance of y, s? is the sample 
variance of the ith predictor, and c = 2.58. Additional details are given in [527]. 

We ran 200,000 RJMCMC iterations. Table 8.2 shows the five models with the 
highest estimated posterior model probabilities. If the goal is to select the best model, 
then the model with the predictors 3, 8, 10, 13, and 14 should be chosen, where the 
predictor indices correspond to those in Table 3.2. 
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TABLE 8.3 RJMCMC results for baseball example: 
the estimated posterior effect probabilities P(6; + Oly) 
exceeding 0.10. The predictor indices and labels 
correspond to those given in Table 3.2. 


Index Predictor P(B; # Oly) 


13 FA 1.00 
14 Arb 1.00 
8 RBIs 0.97 
10 SOs 0.78 
3 Runs 0.55 

Hits 0.52 
25 SBsx OBP 0.13 
24 SOs x errors 0.12 
9 Walks 0.11 


The posterior effect probabilities, P(8; # Oly), for those predictors with prob- 
abilities greater than 0.10 are given in Table 8.3. Each entry is a weighted average of 
an indicator variable that equals 1 only when the coefficient is in the model, where the 
weights correspond to the posterior model probabilities as in Equation (8.22). These 
results indicate that free agency, arbitration status, and the number of runs batted in 
are strongly associated with baseball players’ salaries. 

Other quantities of interest based on variants of Equation (8.22) can be com- 
puted, such as the model-averaged posterior expectation and variance for each regres- 
sion parameter, or various posterior salary predictions. 


Alternative approaches to transdimensional Markov chain simulation have been 
proposed. One method is based on the construction of a continuous-time Markov 
birth-and-death process [89, 613]. In this approach, the parameters are modeled via 
a point process. A general form of RIMCMC has been proposed that unifies many of 
the existing methods for assessing uncertainty about the dimension of the parameter 
space [261]. Continued useful development in these areas is likely [318, 428, 603]. 
One area of promise is to combine RJMCMC and AMCMC. 


8.3 AUXILIARY VARIABLE METHODS 


An important area of development in MCMC methods concerns auxiliary variable 
strategies. In many cases, such as Bayesian spatial lattice models, standard MCMC 
methods can take too long to mix properly to be of practical use. In such cases, one 
potential remedy is to augment the state space of the variable of interest. This approach 
can lead to chains that mix faster and require less tuning than the standard MCMC 
methods described in Chapter 7. 

Continuing with the notation introduced in Chapter 7, we let X denote a random 
variable on whose state space we will simulate a Markov chain, usually for the 
purpose of estimating the expectation of a function of X ~ f(x). In Bayesian ap- 
plications, it is important to remember that the random variables X“ simulated in 
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an MCMC procedure are typically parameter vectors whose posterior distribution 
is of primary interest. Consider a target distribution f which can be evaluated but 
not easily sampled. To construct an auxiliary variable algorithm, the state space of 
X is augmented by the state space of a vector of auxiliary variables, U. Then one 
constructs a Markov chain over the joint state space of (X, U) having stationary dis- 
tribution (X, U) ~ f(x, u) that marginalizes to the target f(x). When simulation has 
been completed, inference is based only on the marginal distribution of X. For exam- 
ple, a Monte Carlo estimator of u = f h(x) f(x) dxis à = (1/n) $; A(X) where 
(X, UO) are simulated in the augmented chain, but the U are discarded. 

Auxiliary variable MCMC methods were introduced in the statistical physics 
literature [174, 621]. Besag and Green noted the potential usefulness of this strategy, 
and a variety of refinements have subsequently been developed [41, 132, 328]. Aug- 
menting the variables of interest to solve challenging statistical problems is effective in 
other areas, such as the EM algorithm described in Chapter 4 and the reversible jump 
algorithms described in Section 8.2. The links between EM and auxiliary variable 
methods for MCMC algorithms are further explored in [640]. 

Below we describe simulated tempering as an example of an auxiliary variable 
strategy. Another important example is slice sampling, which is discussed in the 
next subsection. In Section 8.7.2 we present another application of auxiliary variable 
methods for the analysis of spatial or image data. 


8.3.1 Simulated Tempering 


In problems with high numbers of dimensions, multimodality, or slow MCMC mixing, 
extremely long chains may be required to obtain reliable estimates of the quantities of 
interest. The approach of simulated tempering provides a potential remedy [235, 438]. 
Simulations are based on a sequence of unnormalized densities f; fori = 1,...,m, 
on a common sample space. These densities are viewed as ranging from cold (i = 1) 
to hot (i = m). Typically only the cold density is desired for inference, with the other 
densities being exploited to improve mixing. Indeed, the warmer densities should be 
designed so that MCMC mixing is faster for them than it is for fi. 

Consider the augmented variable (X, 7) where the temperature J is now viewed 
as random with prior J ~ p(i). From a starting value, (x, i), we may construct a 
Metropolis—Hastings sampler in the augmented space as follows: 


1. Use a Metropolis—Hastings or Gibbs update to draw XCF) |7@ from a chain 
with stationary distribution fj. 


2. Generate /* from a proposal density, g (- li d). A simple option is 


1 if (©, i) = (1,2 or (©, i*) = (m,m — 1), 
g(a) =4 1 if | — | = 1 andi € {2,...,m— 1}, 
0 


otherwise. 


3. Accept or reject the candidate 7* as follows. Define the Metropolis—Hastings 
ratio to be Rsr (i, I*, x@+)) , where 


fv(Z) pv) g(ulv) 


EOE Boe eas (8.23) 
fu(Z) p(w) g(v|u) 


Rsr(u, v, Z) = 
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and accept /“+) = J* with probability min { Rst (®, TE XCD), 1}. Other- 
wise, keep another copy of the current state, setting J+) = i®., 


4. Return to step 1. 


The simplest way to estimate an expectation under the cold distribution is to average 
realizations generated from it, throwing away realizations generated from other fi. 
To use more of the data, note that a state (x, i) drawn from the stationary distribution 
of the augmented chain has density proportional to f;(x) p(i). Therefore, importance 


weights w*(x) = f(x)/Lf;(x)p(@] can be used to estimate expectations with respect 
to a target density f; see Chapter 6. 

The prior p is set by the user and ideally should be chosen so that the m tem- 
pering distributions (i.e., the m states for i) are visited roughly equally. In order for 
all the tempering distributions to be visited in a tolerable running time, m must be 
fairly small. On the other hand, each pair of adjacent tempering distributions must 
have sufficient overlap for the augmented chain to move easily between them. This 
requires a large m. To balance these competing concerns, choices for m that provide 
acceptance rates roughly in the range suggested in Section 7.3.1.3 are recommended. 
Improvements, extensions, and related techniques are discussed in [232, 235, 339, 
417, 480]. Relationships between simulated tempering and other MCMC and impor- 
tance sampling methods are explored in [433, 682]. 

Simulated tempering is reminiscent of the simulated annealing optimization 
algorithm described in Chapter 3. Suppose we run simulated tempering on the state 
space for 0. Let L(@) and q(@) be a likelihood and prior for 0, respectively. If we let 
fi(0) = exp {(1/7;) log{q(@)L(@)}} for t; = i andi = 1,2,..., then i = 1 makes the 
cold distribution match the posterior for 0, and i > 1 generates heated distributions 
that are increasingly flattened to improve mixing. Equation (8.23) then evokes step 2 
of the simulated annealing algorithm described in Section 3.3 to minimize the negative 
log posterior. We have previously noted that simulated annealing produces a time- 
inhomogeneous Markov chain in its quest to find an optimum (Section 3.3.1.2). The 
output of simulated tempering is also a Markov chain, but simulated tempering does 
not systematically cool in the same sense as simulated annealing. The two procedures 
share the idea of using warmer distributions to facilitate exploration of the state space. 


8.3.2 Slice Sampler 


An important auxiliary variable MCMC technique is called the slice sampler [132, 
328, 481]. Consider MCMC for a univariate variable X ~ f(x), and suppose that it is 
impossible to sample directly from f. Introducing any univariate auxiliary variable U 
would allow us to consider a target density for (X, U) ~ f(x, u). Writing f(x, u) = 
(x) f(u|x) suggests an auxiliary variable Gibbs sampling strategy that alternates 
between updates for X and U [328]. The trick is to choose a U variable that speeds 
MCMC mixing for X. At iteration t+ 1 of the slice sampler we alternately generate 
X“+D and U“*) according to 


UHD) xO ~ Unit (0, P) ) , (8.24) 


XED |y D ~ Unif {x : fœ@ = gern : (8.25) 
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u+!) 


{x : f(x) > ut} 
FIGURE 8.3 Two steps of a univariate slice sampler for target distribution f. 


Figure 8.3 illustrates the approach. At iteration t + 1, the algorithm starts at 
x shown in the upper panel. Then U“*) is drawn from Unif(0, f (x) ). In 
the top panel this corresponds to sampling along the vertical shaded strip. Now 
XD|(UGD = vw) is drawn uniformly from the set of x values for which 
f(x) > u“+. In the lower panel this corresponds to sampling along the horizon- 
tal shaded strip. 

While simulating from (8.25) is straightforward for this example, in other set- 
tings the set {x : f(x) = ult) may be more complicated. In particular, sampling 
XD) (UCD =u) in (8.25) can be challenging if f is not invertible. One ap- 
proach to implementing Equation (8.25) is to adopt a rejection sampling approach; 
see Section 6.2.3. 


Example 8.4 (Moving between Distant Modes) When the target distribution is 
multimodal, one advantage of a slice sampler becomes more apparent. Figure 8.4 
shows a univariate multimodal target distribution. If a standard Metropolis—Hastings 
algorithm is used to generate samples from the target distribution, then the algorithm 
may find one mode of the distribution, but it may take many iterations to find the other 
mode unless the proposal distribution is very well tuned. Even if it finds both modes, it 
will almost never jump from one to the other. This problem will be exacerbated when 
the number of dimensions increases. In contrast, consider a slice sampler constructed 
to sample from the density shown in Figure 8.4. The horizontal shaded areas indicate 
the set defined in (8.25) from which X“+) |y@+ is uniformly drawn. Hence the slice 
sampler will have about a 50% chance of switching modes each iteration. Therefore 
the slice sampler will mix much better with many fewer iterations required. 


Slice samplers have been shown to have attractive theoretical properties [467, 
543] but can be challenging to implement in practice [481, 543]. The basic slice 


260  CHAPTER8 ADVANCED TOPICS IN MCMC 


FIGURE 8.4 The slice sampler for this multimodal target distribution draws Xe) [yD 
uniformly from the set indicated by the two horizontal shaded strips. 


sampler described above can be generalized to include multiple auxiliary variables 
U,,..., Ug and to accommodate multidimensional X [132, 328, 467, 543]. It is also 
possible to construct a slice sampler such that the algorithm is guaranteed to sample 
from the stationary distribution of the Markov chain [98, 466]. This is a variant of 
perfect sampling, which is discussed in Section 8.5. 


8.4 OTHER METROPOLIS-HASTINGS ALGORITHMS 


8.4.1 Hit-and-Run Algorithm 


The Metropolis—Hastings algorithm presented in Section 7.1 is time homogeneous in 
the sense that the proposal distribution does not change as ¢ increases. It is possible 
to construct MCMC approaches that rely on time-varying proposal distributions, 
gO (- Ix). Such methods can be very effective, but their convergence properties are 
generally more difficult to ascertain due to the time inhomogeneity [462]. The adaptive 
MCMC algorithms of Section 8.1 are examples of time-inhomogeneous algorithms. 

One such strategy that resembles a random walk chain is known as the hit-and- 
run algorithm [105]. In this approach, the proposed move away from x” is generated 
in two stages: by choosing a direction to move and then a distance to move in the 
chosen direction. After initialization at x, the chain proceeds from t = 0 with the 
following steps. 


1. Draw a random direction p” ~ h(p), where h is a density defined over the 
surface of the unit p-sphere. 


2. Find the set of all real numbers À for which x + A" is in the state space of 
X. Denote this set of signed lengths as A”. 
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3. Draw arandom signed length a |(x, p) ~ ge (Alx®, p®), where the den- 
sity g (Alx®, p) = g (x + Ap) is defined over AM. The proposal dis- 
tribution may differ from one iteration to the next only through a dependence 
on AM, 


4. For the proposal X* = x + 1, compute the Metropolis—Hastings ratio 


o xe) £0 2(0) 
x (x" a ) Ff (x) 9 (X*)’ 


xe) = ee with probability min {R (x, X*) , 1}, 


x otherwise. 


6. Increment f and go to step 1. 


The above algorithm is one variant of several general hit-and-run approaches [105]. 

The direction distribution h is frequently taken to be uniform over the surface 
of the unit sphere. In p dimensions, a random variable may be drawn from this 
distribution by sampling a p-dimensional standard normal variable Y ~ N(0, I) and 
making the transformation p = Y/V YTY. 

The performance of this approach has been compared with that of other simple 
MCMC methods [104]. It has been noted that the hit-and-run algorithm can offer 
particular advantage when the state space of X is sharply constrained [29], thereby 
making it difficult to explore all regions of the space effectively with other methods. 
The choice of h has a strong effect on the performance and convergence rate of the 
algorithm, with the best choice often depending on the shape of f and the geometry 
of the state space (including constraints and the chosen units for the coordinates of 
X) [366]. 


8.4.2 Multiple-Try Metropolis—Hastings Algorithm 


If a Metropolis—Hastings algorithm is not successful in some problem, it is probably 
because the chain is slow to converge or trapped in a local mode of f. To overcome 
such difficulties, it may pay to expand the region of likely proposals characterized by 
g(-|x). However, this strategy often leads to very small Metropolis—Hastings ratios 
and therefore to poor mixing. Liu, Liang, and Wong proposed an alternative strategy 
known as multiple-try Metropolis—Hastings sampling for effectively expanding the 
proposal region to improve performance without impeding mixing [420]. 

The approach is to generate a larger number of candidates, thereby improving 
exploration of f near x. One of these proposals is then selected in a manner that 
ensures that the chain retains the correct limiting stationary distribution. We still use 
a proposal distribution g, along with optional nonnegative weights A(x, x*), where 
the symmetric function À is discussed further below. To ensure the correct limiting 
stationary distribution, it is necessary to require that g(x*|x) > 0 if and only if 
g(x |x*) > 0, and that A(x, x*) > 0 whenever g(x*|x) > 0. 
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Let x) denote the starting value, and define 
w(u, v) = f(v)g(ulv)A(u, v). (8.26) 
Then, for t = 0, 1, . . ., the algorithm proceeds as follows: 


1. Sample k proposals X},..., X% i.i.d. from g(-[x). 

2. Randomly select a single proposal xj from the set of proposals, with probability 
proportional to w(x, x7) for j=1,...,k. 

3. Given Xj = x", sample k — 1 random variables X{*, ..., X%* į i.i.d. from the 
proposal density 8(-|X}). Set X* = x, 

4. Calculate the generalized Metropolis—Hastings ratio 


k k 
R; = 5 w(x, X*) 5 w(X*, X**). (8.27) 


i=1 i=1 


J 


xt) X* with probability min{Rg, 1}, (8.28) 
~ | x otherwise. 


6. Increment f and go to step 1. 


It is straightforward to show that this algorithm yields a reversible Markov chain with 
limiting stationary distribution equal to f. The efficiency of this approach depends 
on k, the shape of f, and the spread of g relative to f. It has been suggested that 
an acceptance rate of 40-50% be a target [420]. In practice, using the multiple-try 
Metropolis—Hastings algorithm to select from one of many proposals at each iteration 
can lead to chains with lower serial correlation. This leads to better mixing in the sense 
that larger steps can be made to find other local modes or to promote movement in 
certain advantageous directions when we are unable to encourage such steps through 
other means. 

The weighting function À can be used to further encourage certain types of pro- 
posals. The simplest choice is A(x”, x*) = 1. An “orientational-biased” method with 
Mx, x*) = { [g(x* |x) + g(x |x*)] pay was suggested in [203]. Another inter- 
esting choice is A(x, x*) = [g(x* |x) g(x |x*)] ~“" defined on the region where 
g(x*|x) > 0. When « = 1, the weight w(x”, x*) corresponds to the importance 
weight f (x*) /g (x* x) assigned to x* when attempting to sample from f using g 
and the importance sampling envelope (see Section 6.4.1). 


8.4.3 Langevin Metropolis—Hastings Algorithm 


In Section 7.1.2 we discussed random walk chains, a type of Markov chain produced 
by a simple variant of the Metropolis—Hastings algorithm. A more sophisticated ver- 
sion, namely a random walk with drift, can be generated using the proposal 


X* = x9 4d 4 oe, (8.29) 
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where 


(8.30) 


d= o°\ dlog f(x) 
T A2 ox 


x=x) 


and €e” is a p-dimensional standard normal random variable. The scalar ø is a tuning 
parameter whose fixed value is chosen by the user to control the magnitude of proposed 
steps. The standard Metropolis—Hastings ratio is used to decide whether to accept this 
proposal, using 


g(x* |x) x exp { (x* x a)" (x* -x0 — a} f (8.31) 


202 


Theoretical results indicate that the parameter o should be selected so that the ac- 
ceptance rate for the Metropolis—Hastings ratio computed using (8.31) should be 
0.574 [548]. 

The proposal distribution for this method is motivated by a stochastic differ- 
ential equation that produces a diffusion (i.e., a continuous-time stochastic process) 
with f as its stationary distribution [283, 508]. To ensure that the discretization 
of this process given by the discrete-time Markov chain described here shares the 
correct stationary distribution, Besag overlaid the Metropolis—Hastings acceptance 
strategy [37]. 

The requirement to know the gradient of the target (8.30) is not as burdensome as 
it may seem. Any unknown multiplicative constant in f drops out when the derivative 
is taken. Also, when exact derivatives are difficult to obtain, they can be replaced with 
numerical approximations. 

Unlike a random walk, this algorithm introduces a drift that favors pro- 
posals that move toward modes of the target distribution. Ordinary Metropolis— 
Hastings algorithms—including the random walk chain and the independence chain— 
generally are driven by proposals that are made independently of the shape of f, 
thereby being easy to implement but sometimes slow to approach stationarity or ad- 
equately explore the support region of f. When performance of a generic algorithm 
is poor, problem-specific Metropolis—Hastings algorithms are frequently employed 
with specialized proposal distributions crafted in ways that are believed to exploit fea- 
tures of the target. Langevin Metropolis—Hastings algorithms also provide proposal 
distributions motivated by the shape of f, but the self-targeting is done generically 
through the use of the gradient. These methods can provide better exploration of the 
target distribution and faster convergence. 

In some applications, the update given by (8.29) can yield Markov chains that 
fail to approach convergence in runs of reasonable length, and fail to explore more 
than one mode of f. Stramer and Tweedie [618] generalize (8.29) somewhat with 
different drift and scaling terms that yield improved performance. Further study of 
Langevin methods is given in [547, 548, 617, 618]. 
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8.5 PERFECT SAMPLING 


MCMC is useful because at the tth iteration it generates a random draw X® whose 
distribution approximates the target distribution f as t > oo. Since run lengths are 
finite in practice, much of the discussion in Chapter 7 pertained to assessing when 
the approximation becomes sufficiently good. For example, Section 7.3 presents meth- 
ods to determine the run length and the number of iterations to discard (i.e., the 
burn-in). However, these convergence diagnostics all have various drawbacks. Perfect 
sampling algorithms avoid all these concerns by generating a chain that has exactly 
reached the stationary distribution. This sounds wonderful, but there are challenges in 
implementation. 


8.5.1 Coupling from the Past 


Propp and Wilson introduced a perfect sampling MCMC algorithm called coupling 
from the past (CFTP) [520]. Other expositions of the CFTP algorithm include [96, 
165, 519]. The website maintained by Wilson surveys much of the early literature on 
CFTP and related methods [667]. 

CFTP is often motivated by saying that the chain is started at £ = —oo and run 
forward to time t = 0. While this is true, convergence does not suddenly occur in 
the step from t = — 1 to t = 0, and you are not required to somehow set t = —oo on 
your computer. Instead, we will identify a window of time from tf = t < 0 to t = 0 
for which whatever happens before qt is irrelevant, and the infinitely long progression 
of the chain prior to t means that the chain is in its stationary distribution by time 0. 

While this strategy might sound reasonable at the outset, in practice it is im- 
possible to know what state the chain is in at time t. Therefore, we must consider 
multiple chains: in fact, one chain started in every possible state at time t. Each chain 
can be run forward from t = t tot = 0. Because of the Markov nature of these chains, 
the chain outcomes at time t + 1 depend only on their status at time t. Therefore, 
this collection of chains completely represents every possible chain that could have 
been run from infinitely long ago in the past. 

The next problem is that we no longer have a single chain, and it seems that 
chain states at time 0 will differ. To remedy this multiplicity, we rely on the idea of 
coupling. Two chains on the same state space with the same transition probabilities 
have coupled (or coalesced) at time t if they share the same state at time ¢. At this 
point, the two chains will have identical probabilistic properties, due to the Markov 
property and the equal transition probabilities. A third such chain could couple with 
these two at time ¢ or any time thereafter. Thus, to eliminate the multiple chains 
introduced above, we use an algorithm that ensures that once chains have coupled, 
they follow the same sample path thereafter. Further, we insist that all chains must 
have coalesced by time 0. This algorithm will therefore yield one chain from t = 0 
onwards which is in the desired stationary distribution. 

To simplify presentation, we assume that X is unidimensional and has finite 
state space with K states. Neither assumption is necessary for CFTP strategies more 
general than the one we describe below. 
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We consider an ergodic Markov chain with a deterministic transition rule q that 
updates the current state of the chain, x, as a function of some random variable 
U“+). Thus, 


xD =q a ue) (8.32) 


For example, a Metropolis—Hastings proposal from a distribution with cumulative 
distribution function F can be generated using q(x, u) = F =1(u), and a random walk 
proposal can be generated using g(x, u) = x + u. In (8.32) we used a univariate UTD, 
but, more generally, chain transitions may be governed by a multivariate vector U“+)). 
We adopt the general case hereafter. 

CFTP starts one chain from each state in the state space at some time t < 0 
and transitions each chain forward using proposals generated by q. Proposals are 
accepted using the standard Metropolis—Hastings ratio. The goal is to find a starting 
time t such that the chains have all coalesced by time t = 0 when run forwards in time 
from t = t. This approach provides X©), which is a draw from the desired stationary 
distribution f. 

The algorithm to find t and thereby produce the desired chain is as follows. 
Let x be the random state at time t of the Markov chain started in state k, with 
k=1,...,K. 


1. Let t = —1. Generate U. Start a chain in each state of the state space at 


time —1, namely See ee ye and run each chain forward to time 0 via the 


update K =q an uv) fork =1,..., K.Ifall K chains are in the same 


state at time 0, then the chains have coalesced and X is a draw from f; the 
algorithm stops. 


2. If the chains have not coalesced, then let t = —2. Generate UD., Start a chain 
in each state of the state space at time —2, and run each chain forward to 


time 0. To do this, let xe =q Cy uo) . Next, you must reuse the U) 


generated in step 1, so xo =q (xe, vo). If all K chains are in the same 


state at time 0, then the chains have coalesced and X® is a draw from f; the 
algorithm stops. 


3. If the chains have not coalesced, move the starting time back to time t = —3 
and update as above. We continue restarting the chains one step further back 
in time and running them forward to time 0 until we start at a t for which all 
K chains have coalesced by time t = 0. At this point the algorithm stops. In 
every attempt, it is imperative that the random updating variables be reused. 
Specifically, when starting the chains at time t, you must reuse the previously 
drawn random number updates UCD Ut? |. UO. Also note that the same 
U© vector is used to update all K chains at the tth iteration. 


Propp and Wilson show that the value of X returned from the CFTP algorithm 
for a suitable q is a realization of a random variable distributed according to the 
stationary distribution of the Markov chain and that this coalescent value will be 
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Iteration 1 Iteration 2 Iteration 3 
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FIGURE 8.5 Example of perfect sampling sample paths. See Example 8.5 for details. 


produced in finite time [520]. Even if all chains coalesce before time 0, you must use 
X©) as the perfect sampling draw; otherwise sampling bias is introduced. 

Obtaining the perfect sampling draw X® from f is not sufficient for most uses. 
Typically we desire an i.i.d. n-sample from f, either for simulation or to use in a Monte 
Carlo estimate of some expectation, u = f h(x) f(x) dx. A perfect iid. sample from 
f can be obtained by running the CFTP algorithm n times to generate n individual 
values for X). If you only want to ensure that the algorithm is, indeed, sampling 
from f, but independence is not required, you can run CFTP once and continue to 
run the chain forward from its state at time t = 0. While the first option is probably 
preferable, the second option may be more reasonable in practice, especially for cases 
where the CFTP algorithm requires many iterations before coalescence is achieved. 
These are only the two simplest strategies available for using the output of a perfect 
sampling algorithm; see also [474] and the references in [667]. 


Example 8.5 (Sample Paths in a Small State Space) In the example shown in 
Figure 8.5, there are three possible states, s1, s2, s3. At iteration 1, a sample path 
is started from each of the three states at time t = —1. A random update U) is 
selected, and xV =q (sk, U ©) for k = 1, 2, 3. The paths have not all coalesced at 
time t = 0, so the algorithm moves to iteration 2. In iteration 2, the algorithm begins 
at time t = —2. The transition rule for the moves from t = —2 tot = —1 is based ona 
newly sampled update variable, U). The transition rule for the moves from t = —1 
to t = 0 relies on the same U value obtained previously in iteration 1. The paths 
have not all coalesced at time t = 0, so the algorithm moves to iteration 3. Here, the 
previous draws for U and UCP are reused and a new UC? is selected. In iteration 
3, all three sample paths visit state s2 at time t = 0, thus the paths have coalesced, 
and X® = sz is a draw from the stationary distribution f. 


Several finer details of CFTP implementation merit mention. First, note that 
CFTP requires reuse of previously generated variables U® and the shared use of the 
same U realization to update all chains at time t. If the U® are not reused, the 
samples will be biased. Propp and Wilson show an example where the regeneration 
of the U® at each time biases the chain toward the extreme states in an ordered state 
space [520]. The reuse and sharing of past U ensures that all chains coalesce by 
t = 0 when started at any time t’ < t, where T is the starting time chosen by CFTP. 
Moreover, this practice ensures that the coalescent state at time 0 is the same for all 
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such chains in a given run, which enables the proof that CFTP produces an exact draw 
from f. 

Second, CFTP introduces a dependence between the t and X it chooses. 
Therefore, bias can be induced if a CFTP run is stopped prematurely before the 
coupling time has been determined. Suppose a CFTP algorithm is run for a long 
time during which coupling does not occur. If the computer crashes or an impa- 
tient user stops and then restarts the algorithm to find a coupling time, this will 
generally bias the sample toward those states in which coupling is easier. An alter- 
native perfect sampling method known as Fill’s algorithm was designed to avoid this 
problem [193]. 

Third, our description of the CFTP algorithm uses the sequence of starting times 
t = —1, —2, ... for successive CFTP iterations. For many problems, this will be in- 
efficient. It may be more efficient to use the sequence t = —1, —2, —4, —8, —16,..., 
which minimizes the worst case number of simulation steps required and nearly min- 
imizes the expected number of required steps [520]. 

Finally, it may seem that this coupling strategy should work if the chains were 
run forwards from time t = 0 instead of backwards; but that is not the case. To under- 
stand why, consider a Markov chain for which some state x’ has a unique predecessor. 
It is impossible for x’ to occur at the random time of first coalescence. If x’ occurred, 
the chain must have already coalesced at the previous time, since all chains must 
have been in the predecessor state. Therefore the marginal distribution of the chain 
at the first coalescence time must assign zero probability to x’ and hence cannot be 
the stationary distribution. Although this forward coupling approach fails, there is a 
clever way to modify the CFTP construct to produce a perfect sampling algorithm for 
a Markov chain that only runs forward in time [666]. 


8.5.1.1 Stochastic Monotonicity and Sandwiching When applying CFTP to 
a chain with a vast finite state space or an infinite (e.g., continuous) state space, it can be 
challenging to monitor whether sample paths started from all possible elements in the 
state space have coalesced by time zero. However, if the state space can be ordered 
in some way such that the deterministic transition rule g preserves the state space 
ordering, then only the sample paths started from the minimum state and maximum 
state in the ordering need to be monitored. 

Let x, y € S denote any two possible states of a Markov chain exploring a 
possibly huge state space S. Formally, S is said to admit the natural componentwise 
partial ordering, x < y, if x; < yi fori =1,...,n and x, y € S. The transition rule 
q iS monotone with respect to this partial ordering if q(x, u) < q (y, u) for all u 
when x < y. Now, if there exist a minimum and a maximum element of the state 
space S, SO Xmin < X < Xmax for all x € S and the transition rule q is monotone, then 
an MCMC procedure that uses this q preserves the ordering of the states at each 
time step. Therefore, CFTP using a monotone transition rule can be carried out by 
simulating only two chains: one started at Xin and the other at Xmax. Sample paths 
for chains started at all other states will be sandwiched between the paths started in 
the maximum and minimum states. When the sample paths started in the minimum 
and maximum states have coalesced at time zero, coalescence of all the intermediate 
chains is also guaranteed. Therefore, CFTP samples from the stationary distribution 
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at t = 0. Many problems satisfy these monotonicity properties; one example is given 
in Section 8.7.3. 

In problems where this form of monotonicity isn’t possible, other related strate- 
gies can be devised [468, 473, 666]. Considerable effort has been focused on devel- 
oping methods to apply perfect sampling for specific problem classes, such as perfect 
Metropolis—Hastings independence chains [123], perfect slice samplers [466], and 
perfect sampling algorithms for Bayesian model selection [337, 575]. 

Perfect sampling is currently an area of active research, and many extensions 
of the ideas presented here have been proposed. While this idea is quite promising, 
perfect sampling algorithms have not been widely implemented for problems of real- 
istic size. Challenges in implementation and long coalescence times have sometimes 
discouraged large-scale realistic applications. Nonetheless, the attractive properties 
of perfect sampling algorithms and continued research in this area will likely motivate 
new and innovative MCMC algorithms for practical problems. 


8.6 MARKOV CHAIN MAXIMUM LIKELIHOOD 


We have presented Markov chain Monte Carlo in the context of Monte Carlo integra- 
tion, with many Bayesian examples. However, MCMC techniques can also be useful 
for maximum likelihood estimation, particularly in exponential families [234, 505]. 
Consider data generated from an exponential family model X ~ f(-|0) where 


(x10) = c1(x)c2(8) exp{O"s(x)}. (8.33) 


Here 0 = (01, ..., 0p) and s(x) = (s;(x), ..., Sp(xX)) are vectors of canonical param- 
eters and sufficient statistics, respectively. For many problems, c2(0) cannot be deter- 
mined analytically, so the likelihood cannot be directly maximized. 

Suppose that we generate X‘),...,X from an MCMC approach having 
fC|W) as the stationary density, where yw is any particular choice for 0 and f(-|W) is 
in the same exponential family as the data density. Then it is easy to show that 


c20)! = (Wy! / exp((0 — W)'s(x)} fly) dx. (8.34) 
Although the MCMC draws are dependent and not exactly from f(-|), 
x _ 1 = Reh (t) c2(p) 
kO) =- Lew {@ b)'s(X y} een (8.35) 


as n — oo by the strong law of large numbers (1.46). Therefore, a Monte Carlo 
estimator of the log likelihood given data x is 


1(0|x) = 6's(x) — log kO), (8.36) 


up to an additive constant. The maximizer of /(@|x) converges to the maximizer of 
the true log likelihood as n — oo. Therefore, we take the Monte Carlo maximum 
likelihood estimate of 0 to be the maximizer of (8.36), which we denote Oy. 
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Hence we can approximate the MLE 6 using simulations from f(-|y) generated 
via MCMC. Of course, the quality of #y will depend greatly on w. Analogously to 


importance sampling, Y = @ is best. In practice, however, we must choose one or 
more values wisely, perhaps through adaptation or empirical estimation [234]. 


8.7 EXAMPLE: MCMC FOR MARKOV RANDOM FIELDS 


We offer here an introduction to Bayesian analysis of Markov random field models 
with emphasis on the analysis of spatial or image data. This topic provides interesting 
examples of many of the methods discussed in this chapter. 

A Markov random field specifies a probability distribution for spatially refer- 
enced random variables. Markov random fields are quite general and can be used 
for many lattice-type structures such as regular rectangular, hexagonal, and irregular 
grid structures [128, 635]. There are a number of complex issues with Markov ran- 
dom field construction that we do not attempt to resolve here. Besag has published a 
number of key papers on Markov random fields for spatial statistics and image anal- 
ysis, including his seminal 1974 paper [35, 36, 40—43]. Additional comprehensive 
coverage of Markov random fields is given in [128, 377, 412, 668]. 

For simplicity, we focus here on the application of Markov random fields to 
a regular rectangular lattice. For example, we might overlay a rectangular grid on a 
map or image and label each pixel or cell in the lattice. The value for the ith pixel 
in the lattice is denoted by x; for i = 1,...,, where n is finite. We will focus on 
binary random fields where x; can take on only two values, 0 and 1, fori = 1,...,7. 
It is generally straightforward to extend methods to the case where x; is continuous 
or takes on more than two discrete values [128]. 

Let xs; define the set of x values for the pixels that are near pixel i. The pixels 
that define ô; are called the neighborhood of pixel i. The pixel x; is not in 6;. A proper 
neighborhood definition must meet the condition that if pixel 7 is a neighbor of pixel j 
then pixel jis a neighbor of pixel i. In a rectangular lattice, a first-order neighborhood 
is the set of pixels that are vertically and horizontally adjacent to the pixel of interest 
(see Figure 8.6). A second-order neighborhood also includes the pixels diagonally 
adjacent from the pixel of interest. 

Imagine that the value x; for the ith pixel is a realization of a random variable X;. 
A locally dependent Markov random field specifies that the distribution of X; given 
the remaining pixels, X_;, is dependent only on the neighboring pixels. Therefore, 
for X_; = XLi, 


f ails- = f (xilxs;) (8.37) 


for i = 1,...,n. Assuming each pixel has a nonzero probability of equaling 0 or 1 
means that the so-called positivity condition is satisfied: that the minimal state space 
of X equals the Cartesian product of the state spaces of its components. The positivity 
condition ensures that the conditional distributions considered later in this section are 
well defined. 
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First Order Second Order 


FIGURE 8.6 Shaded pixels indicate a first-order and a second-order neighborhood of pixel 
i for a rectangular lattice. 


The Hammersley—Clifford theorem shows that the conditional distributions in 
(8.37) together specify the joint distribution of X up to a normalizing constant [35]. For 
our discrete binary state space, this normalizing constant is the sum of f (x) over all 
x in the state space. This sum is not usually available by direct calculation, because 
the number of terms is enormous. Even for an unrealistically small 40 x 40 pixel 
image where the pixels take on binary values, there are 21600 = 4.4 x 1048! terms in 
the summation. Bayesian MCMC methods provide a Monte Carlo basis for inference 
about images, despite such difficulties. We describe below several approaches for 
MCMC analysis of Markov random field models. 


8.7.1 Gibbs Sampling for Markov Random Fields 


We begin by adopting a Bayesian model for analysis of a binary Markov random 
field. In the introduction to Markov random fields above, we used x; to denote the 
value of the ith pixel. Here we let X; denote the unknown true value of the ith pixel, 
where X; is treated as a random variable in the Bayesian paradigm. Let y; denote the 
observed value for the ith pixel. Thus X is a parameter vector and y is the data. In an 
image analysis application, y is the degraded image and X is the unknown true image. 
In a spatial statistics application of mapping of plant or animal species distributions, 
yi = 0 might indicate that the species was not observed in pixel i during the sampling 
period and X; might denote the true (unobserved) presence or absence of the species 
in pixel i. 

Three assumptions are fundamental to the formulation of this model. First, we 
assume that observations are mutually independent given true pixel values. So the 
joint conditional density of Y given X = x is 


n 


(Gis Galera = [| FOU: (8.38) 


i=1 
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where f(yi|xi) is the density of the observed data in pixel i given the true value. Thus, 
viewed as a function of x, (8.38) is the likelihood function. Second, we adopt a locally 
dependent Markov random field (8.37) to model the true image. Finally, we assume 
the positivity condition, defined above. 

The parameters of the model are x1, ..., Xn, and the goal of the analysis is to 
estimate these true values. To do this we adopt a Gibbs sampling approach. Assume 
the prior X ~ f(x) for the parameters. The goal in the Gibbs sampler, then, is to 
obtain a sample from the posterior density of X, 


Fa x fyw f(x). (8.39) 


One class of prior densities for X is given by 


n 
f(x) x exp ¢ — 5 d(x; — xj) >, (8.40) 
inj 
where i ~ j denotes all pairs such that pixel i is a neighbor of pixel j, and ¢@ is some 
function that is symmetric about 0 with @(z) increasing as |z| increases. Equation 
(8.40) is called a pairwise difference prior. Adopting this prior distribution based on 
pairwise interactions simplifies computations but may not be realistic. Extensions to 
allow for higher-order interactions have been proposed [635]. 
The Gibbs sampler requires the derivation of the univariate conditional distri- 
butions whose form follows from (8.37) to (8.39). The Gibbs update at iteration t is 
therefore 


xe) (x, y) ~f (xð, y) . (8.41) 


A common strategy is to update each X; in turn, but it can be more computationally 
efficient to update the pixels in independent blocks. The blocks are determined by 
the neighborhoods defined for a particular problem [40]. Other approaches to block 
updating for Markov random field models are given in [382, 563]. 


Example 8.6 (Utah Serviceberry Distribution Map) An important problem in 
ecology is the mapping of species distributions over a landscape [286, 584]. These 
maps have a variety of uses, ranging from local land-use planning aimed at minimizing 
human development impacts on rare species to worldwide climate modeling. Here we 
consider the distribution of a deciduous small tree or shrub called the Utah serviceberry 
Amelanchier utahensis in the state of Colorado [414]. 

We consider only the westernmost region of Colorado (west of approximately 
104°W longitude), a region that includes the Rocky Mountains. We binned the 
presence—absence information into pixels that are approximately 8 x 8 km. This 
grid consists of a lattice of 46 x 54 pixels, giving a total of n = 2484 pixels. The left 
panel in Figure 8.7 shows presence and absence, where each black pixel indicates 
that the species was observed in that pixel. 

In typical applications of this model, the true image is not available. However, 
knowing the true image allows us to investigate various aspects of modeling binary 
spatially referenced data in what follows. Therefore, for purposes of illustration, we 
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FIGURE 8.7 Distribution of Utah serviceberry in western Colorado. The left panel is the 
true species distribution, and the right panel is the observed species distribution used in 
Example 8.6. Black pixels indicate presence. 


will use these pixelwise presence—absence data as a true image and consider estimating 
this truth from a degraded version of the image. A degraded image is shown in the 
right panel of Figure 8.7. We seek a map that reconstructs the true distribution of 
this species using this degraded image, which is treated as the observed data y. The 
observed data were generated by randomly selecting 30% of the pixels and switching 
their colors. Such errors might arise in satellite images or other error-prone approaches 
to species mapping. 

Let x; = 1 indicate that the species is truly present in pixel i. In a species map- 
ping problem such as this one, such simple coding may not be completely appropriate. 
For example, a species may be present only in a portion of pixel i, or several sites 
may be included in one pixel, and thus we might consider modeling the proportion 
of sites in each pixel where the species was observed to be present. For simplicity, 
we assume that this application of Markov random fields is more akin to an image 
analysis problem where x; = 1 indicates that the pixel is black. 

We consider the simple likelihood function arising from the data density 


f(yIx) x exp D> ia) (8.42) 
i=1 


for x; € {0, 1}. The parameter œ can be specified as a user-selected constant or esti- 
mated by adopting a prior for it. We adopt the former approach here, setting œ = 1. 
We assume the pairwise difference prior density for X given by 


f(x) x exp ¢ B 5 Veer (8.43) 


i~j 


forx € S= {0, pes, We consider a first-order neighborhood, so summation over 
i ~ j in (8.43) indicates summation over the horizontally and vertically adjacent 
pixels of pixel i, for i= 1,...,n. Equation (8.43) introduces the hyperparame- 
ter 6, which can be assigned a hyperprior or specified as a constant. Usually £ is 
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FIGURE 8.8 Estimated posterior mean of X for the Gibbs sampler analysis in Example 8.6. 


restricted to be positive to encourage clustering of similar-colored pixels. Here we set 
B = 0.8. Sensitivity analysis to determine the effect of chosen values for œ and £ is 
recommended. 

Assuming (8.42) and (8.43), the univariate conditional distribution for X;|x_;, y 
is Bernoulli. Thus during the (t + 1)th cycle of the Gibbs sampler, the ith pixel is set 
to 1 with probability 


p(x"? = 1x, 9) 


= (1 + exp {a(1=0) -1 w=) +8% (1, mo 0} ~ 1, PODE (8.44) 


i~j 
fori = 1,...,n. Recall that 


x) = (D... aD 3 10), 
so neighbors are always assigned their most recent values as soon as they become 
available within the Gibbs cycle. 

Figure 8.8 gives the posterior mean probability of presence for the Utah ser- 
viceberry in western Colorado as estimated using the Gibbs sampler described above. 
Figure 8.9 shows that the mean posterior estimates from the Gibbs sampler success- 
fully discriminate between true presence and absence. Indeed, if pixels with posterior 
mean of 0.5 or larger are converted to black and pixels with posterior mean smaller 
than 0.5 are converted to white, then 86% of the pixels are labeled correctly. 


The model used in Example 8.6 is elementary, ignoring many of the important 
issues that may arise in the analysis of such spatial lattice data. For example, when 
the pixels are created by binning spatially referenced data, it is unclear how to code 
the observed response for pixel i if the species was observed to be present in some 
portions of it and not in other portions. 
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FIGURE 8.9 Boxplots of posterior mean estimates of P[X; = 1] for Example 8.6. Averaging 
pixel-specific sample paths from the Gibbs sampler provides an estimate of P[X; = 1] for each 
i. The boxplots show these estimates split into two groups corresponding to pixels where the 
serviceberry was truly present and pixels where it wasn’t. 


A model that addresses this problem uses a latent binary spatial process over 
the region of interest [128, 217]. Let à(s) denote a binary process over the image 
region, where s denotes coordinates. Then the proportion of pixel i that is occupied 
by the species of interest is given by 


1 


Pi lias)=1} dS, (8.45) 


|A; | s within pixel i 
where |A;| denotes the area of pixel i. The Y;|x; are assumed to be conditionally 
independent Bernoulli trials with probability of detecting presence given by pi, so 
PIY; = 1|X; = 1] = pi. This formalization allows for direct modeling when pixels 
may contain a number of sites that were sampled. A more complex form of this model 
is described in [217]. We may also wish to incorporate covariate data to improve our 
estimates of species distributions. For example, the Bernoulli trials may be modeled 
as having parameters p; for which 


le { a \ wien, (8.46) 
1— pi 


where w; is a vector of covariates for the ith pixel, B is the vector of coefficients 
associated with the covariates, and y; is a spatially dependent random effect. These 
models are popular in the field of spatial epidemiology; see [44, 45, 411, 503]. 


8.7.2 Auxiliary Variable Methods for Markov Random Fields 


Although convenient, the Gibbs sampler implemented as described in Section 8.7.1 
can have poor convergence properties. In Section 8.3 we introduced the idea of 
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FIGURE 8.10 Illustration of the Swendsen-Wang e 


incorporating auxiliary variables to improve convergence and mixing of Markov 
chain algorithms. For binary Markov random field models, the improvement can be 
profound. 

One notable auxiliary variable technique is the Swendsen-Wang algorithm 
[174, 621]. As applied to binary Markov random fields, this approach creates a 
coarser version of an image by clustering neighboring pixels that are colored simi- 
larly. Each cluster is then updated with an appropriate Metropolis—Hastings step. This 
coarsening of the image allows for faster exploration of the parameter space in some 
applications [328]. 

In the Swendsen—Wang algorithm, clusters are created via the introduction of 
bond variables, U;;, for each adjacency i ~ jin the image. Clusters consist of bonded 
pixels. Adjacent like-colored pixels may or may not be bonded, depending on U;j. 
Let U;; = 1 indicate that pixels i and j are bonded, and U;; = 0 indicate that they 
are not bonded. The bond variables Uj; are assumed to be conditionally independent 
given X = x. Let U denote the vector of all the Uj;;. 

Loosely speaking, the Swendsen—Wang algorithm alternates between growing 
clusters and coloring them. Figure 8.10 shows one cycle of the algorithm applied to 
a4 x 4 pixel image. The left panel in Figure 8.10 shows the current image and the 
set of all possible bonds for a 4 x 4 image. The middle panel shows the bonds that 
were generated at the start of the next iteration of the Swendsen—Wang algorithm. We 
will see below that bonds between like-colored pixels are generated with probability 
1 — exp{—f}, so like-colored neighbors are not forced to be bonded. Connected sets 
of bonded pixels form clusters. We’ve drawn boxes around the five clusters in the 
middle panel of Figure 8.10. This shows the coarsening of the image allowed by the 
Swendsen—Wang algorithm. At the end of each iteration, the color of each cluster is 
updated: Clusters are randomly recolored in a way that depends upon the posterior 
distribution for the image. The updating produces the new image in the right panel in 
Figure 8.10. The observed data y are not shown here. 

Formally, the Swendsen—Wang algorithm is a special case of a Gibbs sampler 
that alternates between updates to X|u and U|x. It proceeds as follows: 


1. Draw independent bond variables 


1 . 
uf] x0 ~ unit (orep {Aiga }) 
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a = ae and 
in this case Cire > 1 with probability 1 — exp{— 8}. When Ue > 1, we 
declare the ith and jth pixels to be bonded for iteration t + 1. 


2. Sample X“T)| u+) ~ f(. [u“T), where 


n 
j (ata) ccomp fa Sepa} 
i=1 
1 8.47 
“Hl {osag seo Alfa) bf i 


i~j 


for all i ~ j adjacencies. Note that og can exceed 1 only if x 


Note that (8.47) forces the color of each cluster to be updated as a single unit. 


3. Increment ¢ and return to step 1. 


Thus for our simple model, pixel pairs with the same color are bonded with probability 
1 — exp{—f}. The bond variables define clusters of pixels, with each cluster consisting 
of a set of pixels that are interlinked with at least one bond. Each cluster is updated 
independently with all pixels in the cluster taking on the same color. Updates in (8.47) 
are implemented by simulating from a Bernoulli distribution where the probability of 
coloring a cluster of pixels, C, black is 


exp {a Die lyi=1) } l 
exp {a Piec liyi=0) } +exp {a Diec y=} 


The local dependence structure of the Markov random field is decoupled from the 
coloring decision given in (8.48), thereby potentially enabling faster mixing. 


(8.48) 


Example 8.7 (Utah Serviceberry Distributions, Continued) To compare the 
performance of the Gibbs sampler and the Swendsen-Wang algorithm, we return 
to Example 8.6. For this problem the likelihood has a dominant influence on the 
posterior. Thus to highlight the differences between the algorithms, we set a = 0 to 
understand what sort of mixing can be enabled by the Swendsen—Wang algorithm. 
In Figure 8.11, both algorithms were started in the same image in iteration 1, and 
the three subsequent iterations are shown. The Swendsen—Wang algorithm produces 
images that vary greatly over iterations, while the Gibbs sampler produces images 
that are quite similar. In the Swendsen—Wang iterations, large clusters of pixels switch 
colors abruptly, thereby providing faster mixing. 

When the likelihood is included, there are fewer advantages to the Swendsen— 
Wang algorithm when analyzing the data from Example 8.6. For the chosen @ and £, 
clusters grow large and change less frequently than in Figure 8.11. In this application, 
sequential images from a Swendsen—Wang algorithm look quite similar to those for a 
Gibbs sampler, and the differences between results produced by the Swendsen—Wang 
algorithm and the Gibbs sampler are small. 


Exploiting a property called decoupling, the Swendsen—Wang algorithm grows 
clusters without regard to the likelihood, conditional on X®. The likelihood and 
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l t Tia he. 
Iteration 1 Iteration 2 Iteration 3 Iteration 4 
FIGURE 8.11 Comparison between Gibbs sampling and the Swendsen—Wang algorithm 
simulating a Markov random field. Iteration 1 is the same for both algorithms. See Example 8.7 

for details. 


image prior terms are separated in steps 1 and 2 of the algorithm. This feature is 
appealing because it can improve mixing rates in MCMC algorithms. Unless œ and 6 
are carefully chosen, however, decoupling may not be helpful. If clusters tend to grow 
large but change color very infrequently, the sample path will consist of rare drastic 
image changes. This constitutes poor mixing. Further, when the posterior distribution 
is highly multimodal, both the Gibbs sampler and the Swendsen—Wang algorithm can 
miss potential modes if the chain is not run long enough. A partial decoupling method 
has been proposed to address such problems, and offers some potential advantages 
for challenging imaging problems [327, 328]. 


8.7.3 Perfect Sampling for Markov Random Fields 


Implementing standard perfect sampling for a binary image problem would require 
monitoring sample paths that start from all possible images. This is clearly impossible 
even in a binary image problem of moderate size. In Section 8.5.1.1 we introduced the 
idea of stochastic monotonicity to cope with large state spaces. We can apply this strat- 
egy toimplement perfect sampling for the Bayesian analysis of Markov random fields. 

To exploit the stochastic monotonicity strategy, the states must be partially 
ordered, sox < y if x; < yi for i = 1,...,n and for x, y € S. In the binary image 
problem, such an ordering is straightforward. If S = {0, 1}”, define x < y if y; = 1 
whenever x; = 1 for alli = 1, ..., n. If the deterministic transition rule q maintains 
this partial ordering of states, then only the sample paths that start from all-black and 
all-white images need to be monitored for coalescence. 


Example 8.8 (Sandwiching Binary Images) Figure 8.12 shows five iterations 
of a Gibbs sampler CFTP algorithm for a 4 x 4 binary image with order-preserving 
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FIGURE 8.12 Sequence of images from a perfect sampling algorithm for a binary image 
problem. See Example 8.8 for details. 


pixelwise updates. The sample path in the top row starts at iteration t = —1000, where 
the image is all black. In other words, oe = l fori = 1,..., 16. The sample path 
in the bottom row starts at all white. The path starting from all black is the upper bound 
and the path starting from all white is the lower bound used for sandwiching. 

After some initial iterations, we examine the paths around t = —400. In the 
lower sample path, the circled pixel at iteration t = —400 changed from white to 
black at t = —399. Monotonicity requires that this pixel change to black in the upper 
path too. This requirement is implemented directly via the monotone update function 
q. Note, however, that changes from white to black in the upper image do not compel 
the same change in the lower image; see, for example, the pixel to the right of the 
circled pixel. 

Changes from black to white in the upper image compel the same change in the 
lower image. For example, the circled pixel in the upper sample path at t = —399 has 
changed from black to white at t = —398, thereby forcing the corresponding pixel in 
the lower sample path to change to white. A pixel change from black to white in the 
lower image does not necessitate a like change to the upper image. 

Examination of the pixels in these sequences of images shows that pixelwise 
image ordering is maintained over the simulation. At iteration t = 0, the two sample 
paths have coalesced. Therefore a chain started at any image at t = —1000 must also 
have coalesced to the same image by iteration t = 0. The image shown at t = 0 is a 
realization from the stationary distribution of the chain. 


Example 8.9 (Utah Serviceberry Distribution, Continued) The setup for the 
CFTP algorithm for the species distribution mapping problem closely follows the 
development of the Gibbs sampler described in Example 8.6. To update the ith pixel 
at iteration ¢ + 1, generate U“*+) from Unif(0, 1). Then the update is given by 


x g(x 0) 


1 ifUcD <P oe = Ix, y) : 


0 otherwise, 


(8.49) 
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where P | xi? = 1| x®, y] is given in (8.44). Such updates maintain the partial 
ordering of the state space. Therefore, the CFTP algorithm can be implemented 
by starting at two initial images: all black and all white. These images are mon- 
itored, and the CFTP algorithm proceeds until they coalesce by iteration t = 0. 
The CFTP algorithm has been implemented for similar binary image problems in 
[165, 166]. 


PROBLEMS 


8.1. One approach to Bayesian variable selection for linear regression models is described 
in Section 8.2.1 and further examined in Example 8.3. For a Bayesian analysis for the 
model in Equation (8.20), we might adopt the normal-gamma conjugate class of priors 
Bling ~ N(Gm,, O° Vin,) and và /o? ~ x2. Show that the marginal density of Y |m; is 
given by 


Tv +n)/2) (ay? 
m2P(v/2)|T + Xm, Ving X1, |! 


mk 


—(v+n)/2 


x [av + (Y = Xing) (I + Xm Vm X7) (Y = Xndtm,) | k 


where X,,, is the design matrix, œm, is the mean vector, and V,,, is the covariance 
matrix for B, for the model m,. 


8.2. Consider the CFTP algorithm described in Section 8.5. 


a. Construct an example with a finite state space to which both the Metropolis—Hastings 
algorithm and the CFTP algorithm can be applied to simulate from some multivariate 
stationary distribution f. For your example, define both the Metropolis—Hastings 
ratio (7.1) and the deterministic transition rule (8.32), and show how these quantities 
are related. 


b. Construct an example with a state space with two elements so that the CFTP algo- 
rithm can be applied to simulate from some stationary distribution f. Define two 
deterministic transition rules of the form in (8.32). One transition rule, qı, should 
allow for coalescence in one iteration, and the other transition rule, q2, should be 
defined so that coalescence is impossible. Which assumption of the CFTP algorithm 
is violated for q2? 


c. Construct an example with a state space with two elements that shows why the CFTP 
algorithm cannot be started at t = 0 and run forward to coalescence. This should 
illustrate the argument mentioned in the discussion after Example 8.5 (page 266). 


8.3. Suppose we desire a draw from the marginal distribution of X that is determined by the 
assumptions that 6 ~ Beta(a@, 6) and X|@ ~ Bin(n, 0) [96]. 
a. Show that 6|x ~ Beta(a+ x, B+n—x). 
b. What is the marginal expected value of X? 


c. Implement a Gibbs sampler to obtain a joint sample of (6, X), using x = 0,a = 10, 
B=5,andn = 10. 
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d. Let U“*” and V“*” be independent Unif(0,1) random variables. Then the transition 


rule from X® = x to X“* can be written as 


XD = 9, YD, yO) 
=i. (eran jA (Coo +x, B+n— x)) ; 


where F} 1 P; Hi, H2) is the inverse cumulative distribution function of the distri- 
bution d with parameters u; and u2, evaluated at p. Implement the CFTP algorithm 
from Section 8.5.1, using the transition rule given in (8.50), to draw a perfect sam- 
ple for this problem. Decrement t by one unit each time the sample paths do not 
coalesce by time 0. Run the function 100 times to produce 100 draws from the 
stationary distribution for œ = 10, 8 = 5, and n = 10. Make a histogram of the 100 
starting times (the finishing times are all t = 0, by construction). Make a histogram 
of the 100 realizations of X®. Discuss your results. 


. Run the function from part (d) several times for a = 1.001, 6 = 1, and n = 10. 


Pick a run where the chains were required to start at t = —15 or earlier. Graph the 
sample paths (from each of the 11 starting values) from their starting time to t = 0, 
connecting sequential states with lines. The goal is to observe the coalescence as in 
the right panel in Figure 8.5. Comment on any interesting features of your graph. 


. Run the algorithm from part (d) several times. For each run, collect a perfect chain 


of length 20 (i.e., once you have achieved coalescence, don’t stop the algorithm 
at t = 0, but continue the chain from t = 0 through t = 19). Pick one such chain 
having x = 0, and graph its sample path for t = 0,..., 19. Next, run the Gibbs 
sampler from part (c) throught = 19 starting with x = 0. Superimpose the sample 
path of this chain on your existing graph, using a dashed line. 


i. Is t = 2 sufficient burn-in for the Gibbs sampler? Why or why not? 


ii. Of the two chains (CFTP conditional on x© = 0 and Gibbs starting from 
x® = 0), which should produce subsequent variates X for t = 1, 2, ... whose 
distribution more closely resembles the target? Why does this conditional CFTP 
chain fail to produce a perfect sample? 


8.4. Consider the one-dimensional black-and-white image represented by a vector of zeros 


and ones. The data (observed image) are 


10101111010000101000010110101001101 


for the 35 pixels y = (y1,..., 35). Suppose the posterior density for the true image 
X is given by 


35 
fly) exp {Som »| exp ps Plan} 


i=1 i~j 


where 


_ f log{2/3} if xi = yi, 
a(xj, yi) = log{1/3} if x; # yi. 


Consider the Swendsen—Wang algorithm for this problem where the bond variable is 
drawn according to Uj;|x ~ Unif (0, exp { Blix) }). 
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Time 


0 10 20 30 
Pixel 
FIGURE 8.13 Forty Gibbs sampler iterates for Problem 8.4, with £ = 1. 


a. Implement the Swendsen—Wang algorithm described above with 6 = 1. Create a 
chain of length 40, starting from the initial image x® equal to the observed data. 
Note that the entire sequence of images can be displayed in a two-dimensional 
graph as shown in Figure 8.13. This figure was created using a Gibbs sampler. Using 
your output from your implementation of the Swendsen—Wang algorithm, create a 
graph analogous to Figure 8.13 for your Swendsen—Wang iterations. Comment on 
the differences between your graph and Figure 8.13. 


b. Investigate the effect of 6 by repeating part (a) for 6 = 0.5 and 6 = 2. Comment 
on the differences between your graphs and the results in part (a). 


c. Investigate the effect of the starting value by repeating part (a) for three different 
starting values: first with x® = (0, ..., 0), second with x® = (1,..., 1), and third 
with x” =Ofori=1,...,17 and x” = | fori = 18,...,35. Compare the results 
of these trials with the results from part (a). 


d. What would be a good way to produce a single best image to represent your estimate 
of the truth? 


8.5. Data corresponding to the true image and observed images given in Figure 8.14 are 
available on the website for this book. The true image is a binary 20 x 20 pixel image 
with prior density given by 


o 
f (xiIXs,) SN (5, Z) 
fori = 1,...,n, where v; is the number of neighbors in the neighborhood ô; of x; and 
Xs, is the mean value of the neighbors of the ith pixel. This density promotes local 
dependence. The observed image is a gray scale degraded version of the true image 
with noise that can be modeled via a normal distribution. Suppose the likelihood is 
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FIGURE 8.14 Images for Problem 8.5. The left panel is the true image, and the right panel 
is an observed image. 


given by 
SOil =N (xi 07) 


fori = 1,...,n. 


a. Prove that univariate conditional posterior distribution used for Gibbs sampling for 
this problem is given by 


1 vi o 
f (lx, y) = N + e ; 


Yi X 
vi +1 vi+1 v +1 


b. Starting with the initial image x equal to the observed data image, and using o = 5 
and a second-order neighborhood, use the Gibbs sampler (with no burn-in period 
or subsampling) to generate a collection of 100 images from the posterior. Do not 
count an image as new until an update has been proposed for each of its pixels 
(i.e., a full cycle). Record the data necessary to make the following plots: the data 
image, the first image sampled from the posterior distribution (X"), the last image 
sampled from the posterior distribution (X“), and the mean image. 

Hints: 


¢ Dealing with the edges is tricky because the neighborhood size varies. You may 
find it convenient to create a matrix with 22 rows and 22 columns consisting of 
the observed data surrounded on each of the four sides by a row or column of 
zeros. If you use this strategy, be sure that this margin area does not affect the 
analysis. 


¢ Plot X at the end of each full cycle so that you can better understand the behavior 
of your chain. 


c. Fill in the rest of a 2 x 3 factorial design with runs analogous to (b), crossing the 
following factors and levels: 


e Neighborhood structure chosen to be (i) first-order neighborhoods or (ii) second- 
order neighborhoods. 
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e Pixelwise error chosen to have variability given by (i) ø =2, (ii) o =5, or 
Gii) o =15. 


Provide plots and detailed comments comparing the results from each design point 
in this experiment. 


. Repeat a run analogous to part (b) once more, but this time using the initial starting 
image x equal to 57.5 (the true posterior mean pixel color) everywhere, for o = 5 
and a first-order neighborhood. Discuss your results and their implications for the 
behavior of the chain. 


PART Il 
BOOTSTRAPPING 


F the previous four chapters, we explored how to estimate expectations of 
random variables. A mean is never enough however. Ideally we would like to 
know the entire probability distribution of the variable. 

Bootstrapping is a computational intensive method that allows re- 
searchers to simulate the distribution of a statistic. The idea is to repeatedly 
resample the observed data, each time producing an empirical distribution 
function from the resampled data. For each resampled data set—or equiva- 
lently each empirical distribution function—a new value of the statistic can 
be computed, and the collection of these values provides an estimate of the 
sampling distribution of the statistic of interest. In this manner, the method 
allows you to “pull yourself up by your bootstraps” (an old idiom, popularized 
in America, that means to improve your situation without outside help). Boot- 
strapping is nonparametric by nature, and there is a certain appeal to letting 
the data speak so freely. 

Bootstrapping was first developed for independent and identically dis- 
tributed data, but this assumption can be relaxed so that bootstrap estimates 
from dependent data such as regression residuals or time series data is pos- 
sible. We will explore bootstrapping methods in both the independent and 
dependent cases, along with approaches for improving performance using 
more complex variations. 
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CHAPTER 9 


BOOTSTRAPPING 


9.1 THE BOOTSTRAP PRINCIPLE 


Let 0 = T(F) be an interesting feature of a distribution function, F, expressed 
as a functional of F. For example, T(F) = f z dF(z) is the mean of the distri- 
bution. Let X1, ...,X„ be data observed as a realization of the random variables 
X1, ..., Xn ~ 1i.d. F. In this chapter, we use X ~ F to denote that X is distributed 
with density function f having corresponding cumulative distribution function F. Let 
X = {X], ..., Xn} denote the entire dataset. 

If F is the empirical distribution function of the observed data, then an estimate 
of 0 is 0 = T(F ). For example, when 0 is a univariate population mean, the estimator 
is the sample mean, 6 = fz dF(z) = X] Xi/n. 

Statistical inference questions are usually posed in terms of T(F ) or some 
R(X, F), a statistical function of the data and their unknown distribution function 
F. For example, a general test statistic might be R(X, F) = [T(F) — T(F)] /S(F), 
where S is a functional that estimates the standard deviation of T( F). 

The distribution of the random variable R(X, F) may be intractable or altogether 
unknown. This distribution also may depend on the unknown distribution F. The 
bootstrap provides an approximation to the distribution of R(¥, F) derived from the 
empirical distribution function of the observed data (itself an estimate of F) [175, 
177]. Several thorough reviews of bootstrap methods have been published since its 
introduction [142, 181, 183]. 

Let ¥* denote a bootstrap sample of pseudo-data, which we will call a pseudo- 
dataset. The elements of X¥* = {X7,..., X*} are i.i.d. random variables with dis- 
tribution F. The bootstrap strategy is to examine the distribution of R(X”, F), that 
is, the random variable formed by applying R to X*. In some special cases it is 
possible to derive or estimate the distribution of R(%*, F) through analytical means 
(see Example 9.1 and Problems 9.1 and 9.2). However, the usual approach is via 
simulation, as described in Section 9.2.1. 


Example 9.1 (Simple Illustration) Suppose n = 3 univariate data points, namely 
{x1, x2, x3} = {1, 2, 6}, are observed as an i.i.d. sample from a distribution F that has 
mean 6. At each observed data value, F places mass 7 Suppose the estimator to be 
bootstrapped is the sample mean 0, which we may write as T(F) or R(X, F), where 
R does not depend on F in this case. 
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TABLE 9.1 Possible bootstrap_pseudo-datasets from {1, 2, 6} (ignoring 
order), the resulting values of 6* = T(F*), the probability of each outcome 
in the bootstrapping experiment (P* [o] ), and the observed relative frequency 


in 1000 bootstrap iterations. 


Observed 

x 0* P* [e] Frequency 

111 3/3 1/27 36/1000 
112 4/3 3/27 101/1000 
122 5/3 3/27 123/1000 
222 6/3 1/27 25/1000 
116 8/3 3/27 104/1000 
126 9/3 6/27 227/1000 
226 10/3 3/27 131/1000 
166 13/3 3/27 111/1000 
266 14/3 3/27 102/1000 
666 18/3 1/27 40/1000 


Let X* = {X*, X3, X3} consist of elements drawn i.i.d. from F. There are 


33 = 27 possible outcomes for X¥*. Let F* denote the empirical distribution function 
of such a sample, with corresponding estimate = T(F*). Since 6* does not depend 
on the ordering of the data, it has only 10 distinct possible outcomes. Table 9.1 lists 
these. 

In Table 9.1, P* [a] represents the probability distribution for 6* with respect 
to the bootstrap experiment of drawing * conditional on the original observations. 
To distinguish this distribution from F, we will use an asterisk when referring to such 
conditional probabilities or moments, as when writing P* [6 < 6] = x. 


The bootstrap principle is to equate the distributions of R(X, F) and R(&*, F ). 
In this example, that means we base inference on the distribution of O*. This distri- 
bution is summarized in the columns of Table 9.1 labeled 6* and P* [o]. So, for 
33) 
using quantiles of the distribution of 6*. The point estimate is still calculated from 
the observed data as 6 = 2. 


example, a simple bootstrap > (roughly 93%) confidence interval for 6 is ( 


9.2 BASIC METHODS 


9.2.1 Nonparametric Bootstrap 


For realistic sample sizes the number of potential bootstrap pseudo-datasets is very 
large, so complete enumeration of the possibilities is not practical. Instead, B in- 
dependent random bootstrap pseudo-datasets are drawn from the empirical distribu- 
tion function of the observed data, namely F. Denote these As = {X4,...,X3,} for 
i= 1,..., B. The empirical distribution of the R(X, F) fori=1,..., Bis used 
to approximate the distribution of R(X, F), allowing inference. The simulation error 
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introduced by avoiding complete enumeration of all possible pseudo-datasets can be 
made arbitrarily small by increasing B. Using the bootstrap frees the analyst from 
making parametric assumptions to carry out inference, provides answers to problems 
for which analytic solutions are impossible, and can yield more accurate answers than 
given by routine application of standard parametric theory. 


Example 9.2 (Simple Illustration, Continued) Continuing with the dataset in 
Example 9.1, recall that the empirical distribution function of the observed data, F, 


places mass A on 1, 2, and 6. A nonparametric bootstrap would generate ¥;* by 


sampling Xž, X%, and X% iid. from F. In other words, draw the X; with replace- 
ment from {1, 2, 6} with equal probability. Each bootstrap pseudo-dataset yields a 
corresponding estimate 6*. Table 9.1 shows the observed relative frequencies of the 
possible values for o* resulting from B = 1000 randomly drawn pseudo-datasets, ¥;*. 
These relative frequencies approximate P* [6]. The bootstrap principle asserts that 
P* [o] in turn approximates the sampling distribution of 0. 

For this simple illustration, the space of all possible bootstrap pseudo-datasets 
can be completely enumerated and the P* [o] exactly derived. Therefore there is 
no need to resort to simulation. In realistic applications, however, the sample size is 
too large to enumerate the bootstrap sample space. Thus, in real applications (e.g., 
Section 9.2.3), only a small proportion of possible pseudo-datasets will ever be drawn, 
often yielding only a subset of possible values for the estimator. 


A fundamental requirement of bootstrapping is that the data to be resampled 
must have originated as an i.i.d. sample. If the sample is not i.i.d., the distributional 
approximation of R(¥, F) by R(¥*, F) will not hold. Section 9.2.3 illustrates that the 
user must carefully consider the relationship between the stochastic mechanism gen- 
erating the observed data and the bootstrap resampling strategy employed. Methods 
for bootstrapping with dependent data are described in Section 9.5. 


9.2.2 Parametric Bootstrap 


The ordinary nonparametric bootstrap described above generates each pseudo-dataset 
X* by drawing X{, ..., X* i.i.d. from F. When the data are modeled to originate from 
a parametric distribution, so X;,..., X, ~ iid. F(x, 0), another estimate of F may 
be employed. Suppose that the observed data are used to estimate 0 by 0. Then each 
parametric bootstrap pseudo-dataset X¥* can be generated by drawing X},..., K% ~ 
Lid. F(x, 0). When the model is known or believed to be a good representation 
of reality, the parametric bootstrap can be a powerful tool, allowing inference in 
otherwise intractable situations and producing confidence intervals that are much 
more accurate than those produced by standard asymptotic theory. 

In some cases, however, the model upon which bootstrapping is based is almost 
an afterthought. For example, a deterministic biological population model might pre- 
dict changes in population abundance over time, based on biological parameters and 
initial population size. Suppose animals are counted at various times using various 
methodologies. The observed counts are compared with the model predictions to find 


290 CHAPTER9 BOOTSTRAPPING 


model parameter values that yield a good fit. One might fashion a second model 
asserting that the observations are, say, lognormally distributed with mean equal to 
the prediction from the biological model and with a predetermined coefficient of 
variation. This provides a convenient—if weakly justified—link between the param- 
eters and the observations. A parametric bootstrap from the second model can then 
be applied by drawing bootstrap pseudo-datasets from this lognormal distribution. 
In this case, the sampling distribution of the observed data can hardly be viewed as 
arising from the lognormal model. 

Such an analysis, relying on an ad hoc error model, should be a last resort. It 
is tempting to use a convenient but inappropriate model. If the model is not a good 
fit to the mechanism generating the data, the parametric bootstrap can lead inference 
badly astray. There are occasions, however, when few other inferential tools seem 
feasible. 


9.2.3 Bootstrapping Regression 


Consider the ordinary multiple regression model, Y; = x} B +e;, fori=1,...,n, 
where the €; are assumed to be i.i.d. mean zero random variables with constant vari- 
ance. Here, x; and £ are p-vectors of predictors and parameters, respectively. A naive 
bootstrapping mistake would be to resample from the collection of response values a 
new pseudo-response, say Y;*, for each observed x;, thereby generating a new regres- 


sion dataset. Then a bootstrap parameter vector estimate, B , would be calculated from 
these pseudo-data. After repeating the sampling and estimation steps many times, the 
empirical distribution of B would be used for inference about $. The mistake is that 
the Y; | x; are not i.i.d—they have different conditional means. Therefore, it is not 
appropriate to generate bootstrap regression datasets in the manner described. 

We must ask what variables are i.i.d. in order to determine a correct bootstrap- 
ping approach. The «€; are i.i.d. given the model. Thus a more appropriate strategy 
would be to bootstrap the residuals as follows. 

Start by fitting the regression model to the observed data and obtaining the fitted 
responses ¥; and residuals é;. Sample a bootstrap set of residuals, {é, 1.2, ê}, from 
the set of fitted residuals, completely at random with replacement. (Note that the ê; 
are actually not independent, though they are usually roughly so.) Create a bootstrap 
set of pseudo-responses, Y;* = Şi + é*, fori = 1, ...,n. Regress Y* on x to obtain a 
bootstrap parameter estimate p. Repeat this process many times to build an empirical 
distribution for B that can be used for inference. 

This approach is most appropriate for designed experiments or other data where 
the x; values are fixed in advance. The strategy of bootstrapping residuals is at the 
core of simple bootstrapping methods for other models such as autoregressive models, 
nonparametric regression, and generalized linear models. 

Bootstrapping the residuals is reliant on the chosen model providing an appro- 
priate fit to the observed data, and on the assumption that the residuals have constant 
variance. Without confidence that these conditions hold, a different bootstrapping 
method is probably more appropriate. 
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TABLE 9.2 Copper-nickel alloy data for illustrating methods of obtaining a bootstrap confidence 
interval for B1/Bo. 


Xj 0.01 0.48 0.71 0.95 1.19 0.01 0.48 


Vi 127.6 124.0 110.8 103.9 101.5 130.1 122.0 
Xi 1.44 0.71 1.96 0.01 1.44 1.96 
Vi 92.3 113.1 83.7 128.0 91.4 86.2 


Suppose that the data arose from an observational study, where both response 
and predictors are measured from a collection of individuals selected at random. 
In this case, the data pairs z; = (x;, yj) can be viewed as values observed for i.i.d. 
random variables Z; = (X;, Y;) drawn from a joint response—predictor distribution. To 
bootstrap, sample Z7, ..., Z% completely at random with replacement from the set of 
observed data pairs, {Z1, . . . , Zn}. Apply the regression model to the resulting pseudo- 
dataset to obtain a bootstrap parameter estimate B . Repeat these steps many times, 
then proceed to inference as in the first approach. This approach of bootstrapping the 
cases is sometimes called the paired bootstrap. 

If you have doubts about the adequacy of the regression model, the constancy 
of the residual variance, or other regression assumptions, the paired bootstrap will be 
less sensitive to violations in the assumptions than will bootstrapping the residuals. 
The paired bootstrap sampling more directly mirrors the original data generation 
mechanism in cases where the predictors are not considered fixed. 

There are other, more complex methods for bootstrapping regression problems 
[142, 179, 183, 330]. 


Example 9.3 (Copper—Nickel Alloy) Table 9.2 gives 13 measurements of cor- 
rosion loss (y;) in copper—nickel alloys, each with a specific iron content (x;) [170]. 
Of interest is the change in corrosion loss in the alloys as the iron content increases, 
relative to the corrosion loss when there is no iron. Thus, consider the estimation of 
0 = 6, /Bo in a simple linear regression. 

Letting z; = (x;, yi) for i=1,..., 13, suppose we adopt the paired boot- 
strapping approach. The observed data isi the estimate @ = B 1/ Bo = = —0.185. For 
i=2,..., 10,000, we draw a bootstrap dataset {Z7,..., Zi} by resampling 13 data 
pairs from the set {z1,..., 213} completely at random with replacement. Figure 9.1 
shows a histogram of the estimates obtained from regressions of the bootstrap datasets. 
The histogram summarizes the sampling variability of @ as an estimator of 0. 


9.2.4 Bootstrap Bias Correction 


A particularly interesting choice for bootstrap analysis when T(F) = 0 is the quantity 
R(X, F)= T(F) — T(F). This represents the bias of T(F) = = 0, and it has mean equal 
to E{o} — 0. The bootstrap estimate of the bias is 5e 2 Or = 0)/B= =0 = 6, 


Example 9.4 (Copper-Nickel Alloy, Continued) For the copper-nickel alloy 
regression data introduced in Example 9.3, the mean value of 6* — 0 among the 
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FIGURE 9.1 Histogram of 10,000 bootstrap estimates of 6,/fo from the nonparametric 
paired bootstrap analysis with the copper-nickel alloy data. 


bootstrap pseudo-datasets is —0.00125, indicating a small degree of negative bias. 
Thus, the bias-corrected bootstrap estimate of 6; /fo is —0.18507 — (—0.00125) = 
—0.184. The bias estimate can naturally be incorporated into confidence interval 
estimates via the nested bootstrap of Section 9.3.2.4. 


An improved bias estimate requires only a little additional effort. Let F* 
denote the empirical distribution of the jth bootstrap pseudo-dataset, and define 
F(x) = DA F¥(x) /B. Then 0 — T(F’) is a better estimate of bias. Compare this 
strategy with bootstrap bagging, discussed in Section 9.7. Study of the merits of these 
and other bias corrections has shown that 0” — T(F’) has superior performance and 
convergence rate [183]. 


9.3 BOOTSTRAP INFERENCE 


9.3.1 Percentile Method 


The simplest method for drawing inference about a univariate parameter 0 using boot- 
strap simulations is to construct a confidence interval using the percentile method. 
This amounts to reading percentiles off the histogram of 6* values produced by boot- 
strapping. It has been the approach implicit in the preceding discussion. 


Example 9.5 (Copper—Nickel Alloy, Continued) Returning to the estimation of 
6 = B1/Bo for the copper—nickel alloy regression data introduced in Example 9.3, 
recall that Figure 9.1 summarizes the sampling variability of 0 as an estimator of 6. 
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A bootstrap 1 — a confidence interval based on the percentile method could be con- 
structed by finding the [(1 — @/2)100]th and [(@/2)100]th empirical percentiles in 
the histogram. The 95% confidence interval for £;/fo using the simple bootstrap 
percentile method is (—0.205, —0.174). 


Conducting a hypothesis test is closely related to estimating a confidence inter- 
val. The simplest approach for bootstrap hypothesis testing is to base the p-value on 
a bootstrap confidence interval. Specifically, consider a null hypothesis expressed in 
terms of a parameter whose estimate can be bootstrapped. If the (1 — a)100% boot- 
strap confidence interval for the parameter does not cover the null value, then the 
null hypothesis is rejected with a p-value no greater than a. The confidence interval 
itself may be obtained from the percentile method or one of the superior approaches 
discussed later. 

Using a bootstrap confidence interval to conduct a hypothesis test often sac- 
rifices statistical power. Greater power is possible if the bootstrap simulations are 
carried out using a sampling distribution that is consistent with the null hypothe- 
sis [589]. Use of the null hypothesis sampling distribution of a test statistic is a 
fundamental tenet of hypothesis testing. Unfortunately, there will usually be many 
different bootstrap sampling strategies that are consistent with a given null hypothe- 
ses, with each imposing various extra restrictions in addition to those imposed by 
the null hypothesis. These different sampling models will yield hypothesis tests 
of different quality. More empirical and theoretical research is needed to develop 
bootstrap hypothesis testing methods, particularly methods for appropriate bootstrap 
sampling under the null hypothesis. Strategies for specific situations are illustrated 
by [142, 183]. 

Although simple, the percentile method is prone to bias and inaccurate coverage 
probabilities. The bootstrap works better when 0 is essentially a location parameter. 
This is particularly important when using the percentile method. To ensure best boot- 
strap performance, the bootstrapped statistic should be approximately pivotal: Its 
distribution should not depend on the true value of 6. Since a variance-stabilizing 
transformation g naturally renders the variance of 9(0) independent of 0, it frequently 
provides a good pivot. Section 9.3.2 discusses several approaches that rely on pivoting 
to improve bootstrap performance. 


9.3.1.1 Justification for the Percentile Method The percentile method can 
be justified by a consideration of a continuous, strictly increasing transformation ¢ and 
a distribution function H that is continuous and symmetric [i.e., H(z) = 1 — H(—z)], 
with the property that 


P [han < pÔ — 6) < hiza] =1- a, (9.1) 


where hg is the w quantile of H. For instance, if gis a normalizing, variance-stabilizing 
transformation, then H is the standard normal distribution. In principle, when F 
is continuous we may transform any random variable X ~ F to have any desired 
distribution G, using the monotone transformation G~'(F(X)). There is therefore 
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nothing special about normalization. In fact, the remarkable aspect of the percentile 
approach is that we are never actually required to specify explicitly ¢ or H. 
Applying the bootstrap principle to (9.1), we have 


1-a ~ P* [ha < 60) — 60) < hiap] 
= P* [ha + 0) < OC) < hia + 60) 
= P [o (har +O) <0 <6" (a2 +46@)]. 0.2) 


Since the bootstrap distribution is observed by us, its percentiles are known quantities 
(aside from Monte Carlo variability which can be made arbitrarily small by increasing 
the number of pseudo-datasets, B). Let € denote the œ quantile of the empirical 
distribution of 6*. Then d— Eh a/2 + (0) © © Eq/2 and ¢7 "(Ay —a/2 + (0) © X El—a/2. 

Next, the original probability statement (9.1) from which we hope to build a 
confidence interval is reexpressed to isolate 0. Exploiting symmetry by noting that 
haj2 = —hħ1—a/2 yields 


P |ø (ha +$@) <0 <67 (hia +t0®)| =1-0 03) 


The confidence limits in this equation happily coincide with the limits in (9.2), for 
which we already have estimates &/2 and Ẹ1—a/2. Hence we may simply read off the 
quantiles for 6* from the bootstrap distribution and use these as the confidence limits 
for 0. Note that the percentile method is transformation respecting in the sense that 
the percentile method confidence interval for a monotone transformation of @ is the 
same as the transformation of the interval for 0 itself [183]. 


9.3.2 Pivoting 


9.3.2.1 Accelerated Bias-Corrected Percentile Method, BC, The acceler- 
ated bias-corrected percentile method, BCg, usually offers substantial improvement 
over the simple percentile approach [163, 178]. For the basic percentile method to 
work well, it is necessary for the transformed estimator pO) to be unbiased with 
variance that does not depend on 6. BC, augments ¢ with two parameters to better 
meet these conditions, thereby ensuring an approximate pivot. 

Assume there exists a monotonically increasing function ¢ and constants a and 
b such that 


_ 9) — 96) 
Ue ans +b (9.4) 


has a N(O, 1) distribution, with 1 + a@(@) > 0. Note that if a = b = 0, this transfor- 
mation leads us back to the simple percentile method. 
By the bootstrap principle, 


U* = p~) — pO) 
1 + ag(6) 


(9.5) 
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has approximately a standard normal distribution. For any quantile of a standard 
normal distribution, say Zg, 


a © P*[U* < za] 


= a Gi < g! (#@) + (Za — b)[1+ av] )| (9.6) 


However, the a quantile of the empirical distribution of 6", denoted £x, is observable 
from the bootstrap distribution. Therefore 


go! (0) +a — b)[1 + ab@)] ) x Ey, (9.7) 
In order to use (9.7), consider U itself: 


l-—a= P[U > zal] 
=P lo Ba (0) + u(a, b, a) [1 + ag@)]))| (9.8) 


where u(a, b, a) = (b — Zq)/[1 — a(b — za)]. Notice the similarity between (9.6) and 
(9.8). If we can find a £ such that u(a, b, œ) = zg — b, then the bootstrap principle 
can be applied to conclude that 0 < £g will approximate a 1 — œ upper confidence 
limit. A straightforward inversion of this requirement yields 


p= 006+ u(a,b,a)) = @ (64 te (9.9) 
= a 1—a(b+z1-«)/’ l 


where © is the standard normal cumulative distribution function and the last equality 
follows from symmetry. Thus, if we knew a suitable a and b, then to find a 1 —a 
upper confidence limit we would first compute £ and then find the 6th quantile of the 
empirical distribution of O, namely &g, using the bootstrap pseudo-datasets. 

For a two-sided l—@œ confidence interval, this approach yields 
P [Egi <0< Ep] x 1 — g, where 


b + Za/2 ) 
= Ọ | b+ ——_—__ ], 9.10 
Pi ( 1 — a(b + zaj2) em) 
b + Z1-a/2 
B =o (b4 ); (9.11) 
í 1 — alb + z1-a/2) 


and g, and &g, are the corresponding quantiles from the bootstrapped values of o. 

As with the percentile method, the beauty of the above justification for BC, is 
that explicit specification of the transformation ¢ is not necessary. Further, since the 
BC, approach merely corrects the percentile levels determining the confidence inter- 
val endpoints to be read from the bootstrap distribution, it shares the transformation- 
respecting property of the simple percentile method. 
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The remaining question is the choice of a and b. The simplest nonparametric 
choices are b = 7! (P ()) and 


n n 3/2 
= : Sw 7 (>: #) (9.12) 
i=l i=l 


Wi = ĝo — ĉe) (9.13) 


with 6(-;) denoting the statistic computed omitting the ith observation, and 6(.) = 
(1/n) ye 1 &-i). A related alternative is to let 


where 


Pores + T 
vi = lim > (ra — e)F + eôi) — T(®)) i (9.14) 


where ô; represents the distribution function that steps from zero to one at the obser- 
vation x; (i.e., unit mass on x;). The y; in (9.14) can be approximated using finite 
differences. The motivation for these quantities and additional alternatives for a and 
b are described by [589]. 


Example 9.6 (Copper—Nickel Alloy, Continued) Continuing the copper—nickel 
alloy regression problem introduced in Example 9.3, we have a = 0.0486 [using 
(9.13)] and b = 0.00802. The adjusted quantiles are therefore 6; = 0.038 and 62 = 
0.986. The main effect of BC, was therefore to shift the confidence interval slightly 
to the right. The resulting interval is (—0.203, —0.172). 


9.3.2.2 The Bootstrap t Another approximate pivot that is quite easy to imple- 
ment is provided by the bootstrap t method, also called the studentized bootstrap [176, 
183]. Suppose 0 = T(F) is to be estimated using 0 = T(F), with V(F) estimating the 


variance of 6. Then it is reasonable to hope that R(X, F) = [T(F) — T(F)]/4/ V(F) 


will be roughly pivotal. Bootstrapping R(X, F) yields a collection of R(X a F). 
Denote by G and G* the distributions of R(X, F) and R(¥*, F), respectively. 
By definition, a | — œ confidence interval for 6 is obtained from the relation 


Pléqj2(G) < R(X, F) < £1-4/2(6)] 
=P a — y V(F)E1-a/2(G) < 0 <  — vine) 
=l-a, 
where x(G) is the a quantile of G. These quantiles are unknown because F (and 
hence G)i is unknown. However, the bootstrap principle implies that the distributions 


G and G* should be roughly equal, so EG) © &(G*) for any a. Thus, a bootstrap 
confidence interval can be constructed as 


(r®- V(F)E1-a/(G*), T(®)-— VP") (9.15) 
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FIGURE 9.2 Histogram of 10,000 values of R(&*, F) from a studentized bootstrap analysis 
with the copper—nickel alloy data. 


where the percentiles of G* are taken from the histogram of bootstrap values of 
R(X*, F). Since these are percentiles in the tail of the distribution, at least several 
thousand bootstrap pseudo-datasets are needed for adequate precision. 


Example 9.7 (Copper—Nickel Alloy, Continued) Continuing the copper—nickel 


alloy regression problem introduced in Example 9.3, an estimator V(F’) of the variance 
of 6; /fo based on the delta method is 


(9.16) 


KEND AA Sumas: SR 
Bi var{B1} i varbo} 2 cov{Bo, Bi} 
Bo BY B BoP 

where the estimated variances and covariance can be obtained from basic regression 


results. Carrying out the bootstrap ¢ method then yields the histogram shown in Fig- 
ure 9.2, which corresponds to G*. The 0.025 and 0.975 quantiles of G* are —5.77 


and 4.44, respectively, and 4/ V(F ) = 0.00273. Thus, the 95% bootstrap t confidence 
interval is (—0.197, —0.169). 


This method requires an estimator of the variance of ð, namely V( F). If no such 
estimator is readily available, a delta method approximation may be used [142]. 

The bootstrap ¢ usually provides confidence interval coverage rates that closely 
approximate the nominal confidence level. Confidence intervals from the bootstrap 
t are most reliable when T( F) is approximately a location statistic in the sense that 
a constant shift in all the data values will induce the same shift in T(F ). They are 
also more reliable for variance-stabilized estimators. Coverage rates for bootstrap 
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t intervals can be sensitive to the presence of outliers in the dataset and should be 
used with caution in such cases. The bootstrap t does not share the transformation- 
respecting property of the percentile-based methods above. 


9.3.2.3 Empirical Variance Stabilization A variance-stabilizing transforma- 
tion is often the basis for a good pivot. A variance-stabilizing transformation of the 
estimator 6 is one for which the sampling variance of the transformed estimator does 
not depend on @. Usually a variance-stabilizing transformation of the statistic to be 
bootstrapped is unknown, but it can be estimated using the bootstrap. 

Start by drawing Bı bootstrap pseudo-datasets X7 for j = 1,..., By. Calculate 


0: for each bootstrap pseudo-dataset, and let F* be the empirical distribution function 
of the jth bootstrap pseudo-dataset. 
For each X E next draw B2 bootstrap pseudo-datasets V hi Seen i HA from F7. 


For each j, let 0 ik denote the parameter estimate from the kth subsample, and let 0; 
be the mean of the @ jk Then 


= 5 GE -F) (2.17) 


is an estimate of the standard error of 0 given 0 = 0}. 


Fit a curve to the set of points 3}, j=1,..., Bı. For a flexible, non- 
parametric fit, Chapter 11 reviews many suitable approaches. The fitted curve is an 
estimate of the relationship between the standard error of the estimator and 0. We 
seek a variance-stabilizing transformation to neutralize this relationship. 

Recall that if Z is a random variable with mean 0 and standard deviation s(0), 
then Taylor series expansion (i.e., the delta method) yields var{g(Z)} ~ g' (0)*s(6). 
For the variance of g(Z) to be constant, we require 


gz) = | Pee (9.18) 
a s(u) 


where a is any convenient constant for which 1/s(u) is continuous on [a, z]. Therefore, 
an approximately variance-stabilizing transformation for 0 may be obtained from our 
bootstrap data by applying (9.18) to the fitted curve from the previous step. The integral 
can be approximated using a numerical integration technique from Chapter 5. Let 2(0) 
denote the result. 

Now that an approximate variance-stabilizing transformation has been esti- 
mated, the bootstrap t may be carried out on the transformed scale. Draw B3 new 
bootstrap pseudo-datasets from F, and apply the bootstrap t method to find an inter- 
val for g(6). Note, however, that the standard error of 2(0) is roughly constant, so we 
can use R(X*, F) = 2(6*) — LÔ) for computing the bootstrap t confidence interval. 
Finally, the endpoints of the resulting interval can be converted back to the scale of 0 


by applying the transformation g~!. 
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The strategy of drawing iterated bootstrap pseudo-datasets from each original 
pseudo-dataset sample can be quite useful in a variety of settings. In fact, it is the 
basis for the confidence interval approach described below. 


9.3.2.4 Nested Bootstrap and Prepivoting Another style of pivoting is pro- 
vided by the nested bootstrap [26, 27]. This approach is sometimes also called the 
iterated or double bootstrap. 

Consider constructing a confidence interval or conducting a hypothesis test 
based on a test statistic Ro(¥, F), given observed data values xj, ...,X, from the 
model X1, ..., Xn ~ iid. F. Let Folq, F) = P[Ro(X, F) < q]. The notation for Fo 
makes explicit the dependence of the distribution of Ro on the distribution of the data 
used in Ro. Then a two-sided confidence interval can be fashioned after the statement 


P [Fo "0/2, F) < R(X, F) < Fy '(1 —a/2, F) o (9.19) 
and a hypothesis test based on the statement 
P [ Rox, F) < Fy‘ — o, F)] =1 o: (9.20) 


Of course, these probability statements depend on the quantiles of Fo, which 
are not known. In the estimation case, F is not known; for hypothesis testing, the null 
value for F is hypothesized. In both cases, the distribution of Ro is not known. We 
can use the bootstrap to approximate Fo and its quantiles. 

The bootstrap begins by drawing B bootstrap pseudo-datasets, V},..., Vp, 
from the empirical distribution F. For the jth bootstrap pseudo-dataset, Ouk the 


statistic Ro(X7, F). Let Folq, F) = (1/B) Di l Lf rocs P Daa} where 1,4} = 1 if 


A is true and zero otherwise. Thus Fo estimates P*[Ro(X*, F) < q], which itself 
estimates P [Ro(&, F) < q| = Fo(q, F) according to the bootstrap principle. Thus, 
the upper limit of the confidence interval would be estimated as Fo ee a/2, P), or 
we would reject the null hypothesis if Ro({x1,..., Xn}, F) > Fd — q, F). This is 
the ordinary nonparametric bootstrap. 

Note, however, that a confidence interval constructed in this manner will not 
have coverage probability exactly equal to 1 — a, because Fo is only a bootstrap 
approximation to the distribution of Ro(4, F). Similarly, the size of the hypothesis 
test is P [ Rox. F)> Fy —a, a) + a, since Folq, F) + Fog, Ò). 

Not knowing the distribution Fo also deprives us of a perfect pivot: The random 
variable Rı(¥, F) = Fo (Ro(4, F), F) has a standard uniform distribution indepen- 
dent of F. The bootstrap principle asserts the approximation of Fo by Fo, and hence 
the approximation of R (4, F) by RX, Fy= Fo(Ro(&, F), F). This allows boot- 
strap inference based on a comparison of R(X, F) to the quantiles of a uniform 
distribution. For hypothesis testing, this amounts to accepting or rejecting the null 
hypothesis based on the bootstrap p-value. 

However, we could instead proceed by acknowledging that R(X, F)~ Fi, 
for some nonuniform distribution Fi. Let Fi(q, F) = PIR (X, F) < q]. Then the 
correct size test rejects the null hypothesis if Ri > F,'( —a, F). A confidence 
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interval with the correct coverage probability is motivated by the statement 
P [F7 (@/2, F) < RX, F) < Fd — a/2, F)] =i =a: As before, F, is un- 


known but may be approximated using the bootstrap. Now the randomness Ri comes 
from two sources: (1) The observed data were random observations from F, and (2) 
given the observed data (and hence F), R 1 is calculated from random resamplings 
from F. To capture both sources of randomness, we use the following nested boot- 
strapping algorithm: 


1. Generate bootstrap pseudo-datasets A,...,%,, each as an iid. random 
sample from the original data with replacement. 


2. Compute Ro( 47, F) for j= 1,..., Bo. 


3. For j = 1, Bo: 

a. Let F; denote the empirical distribution function of 17. ~. Draw B, iterated 
bootstrap pseudo-datasets, aT rene jB,» each as an i. i d. random sample 
from Fj. 

b. Compute the Ro( Xi", Fj) fork =1,..., By. 

c. Compute 


R(X%*, F) = Fo(Ro(X%, F), F) 


=—S lf oas <1: (9.21) 
Bi 2 { Ro X¥ FaR P} 


4. Denote the empirical distribution function of the resulting sample of R(x a F) 
as F]. 

5. Use Ri ({x1, ...,Xn}, F) and the quantiles of F 1 to construct the confidence 
interval or hypothesis test. 


Steps 1 and 2 capture the first source of randomness by applying the bootstrap principle 
to approximate F by F. Step 3 captures the second source of randomness introduced 
in R 1 when Rog is bootstrapped conditional on F. 


Example 9.8 (Copper—Nickel Alloy, Continued) Returning to the regression 
problem introduced in Example 9.3, let Ro({x1,...,x13}, F) = B1/Bo — Bi /Bo. 
Figure 9.3 shows a histogram of R values obtained by the nested bootstrap with 
Bo = B, = 300. This distribution shows that Fy differs noticeably from uniform. 
Indeed, the nested bootstrap gave 0.025 and 0.975 quantiles of R as 0.0316 and 
0.990, respectively. The 3.16% and 99.0% percentiles of Ro(A™*, F) are then found 
and used to construct a confidence interval for 6; /Bo, namely (—0.197, —0.168). 


With its nested looping, the double bootstrap can be much slower than 
other pivoting methods: In this case nine times more bootstrap draws were used 
than for the preceding methods. There are reweighting methods such as bootstrap 


9.3 BOOTSTRAP INFERENCE 301 


40 4 
g 
5 
= 20 7 
fa 
0 4 
T T T 
0 0.5 1.0 
R(X”, PF) 


FIGURE 9.3 Histogram of 300 values of R(x ae F) from a nested bootstrap analysis with 
the copper-—nickel alloy data. 


recycling that allow reuse of the initial sample, thereby reducing the computational 
burden [141, 484]. 


9.3.3 Hypothesis Testing 


The preceding discussion about bootstrap construction of confidence intervals is 
relevant for hypothesis testing, too. A hypothesized parameter value outside a 
(1 — a)100% confidence interval can be rejected at a p-value of a. Hall and Wilson 
offer some additional advice to improve the statistical power and accuracy of bootstrap 
hypothesis tests [302]. 

First, bootstrap resampling should be done in a manner that reflects the null 
hypothesis. To understand what this means, consider a null hypothesis about a uni- 
variate parameter 0 with null value 6. Let the test statistic be R(¥, F) = 6 — 6. 
The null hypothesis would be rejected in favor of a simple two-sided alternative when 
| — | is large compared to a reference distribution. To generate the reference distri- 
bution, it may be tempting to resample values R(4*, F) = 6* — Oo via the bootstrap. 
However, if the null is false, this statistic does not have the correct reference distri- 
bution. If 9 is far from the true value of 6, then |@ — 69| will not seem unusually 
large compared to the bootstrap distribution of |@* — 9|. A better approach is to use 
values of R(X*, F ) = 6* — 8 to generate a bootstrap estimate of the null distribution 
of R(X, F). When 6p is far from the true value of 6, the bootstrap values of |6* — 6| 
will seem quite small compared to |@ — 49|. Thus, comparing 6 — 6o to the bootstrap 
distribution of 6* — Ô yields greater statistical power. 


302 CHAPTER9 BOOTSTRAPPING 


Second, we should reemphasize the importance of using a suitable pivot. It is 
often best to base the hypothesis test on the bootstrap distribution of (6* — 6)/6*, 
where 6* is the value of a good estimator of the standard deviation of 6* computed 
from a bootstrap pseudo-dataset. This pivoting approach is usually superior to basing 
the test on the bootstrap distribution of (6* — 6) /6, (6* — 6p) /6, 6* — 6, or 6* — %, 
where 6 estimates the standard deviation of 6 from the original dataset. 


9.4 REDUCING MONTE CARLO ERROR 


9.4.1 Balanced Bootstrap 


Consider a bootstrap bias correction of the sample mean. The bias correction should 
equal zero because X is unbiased for the true mean w. Now, R(¥, F) = X — n, and 
the corresponding bootstrap values are R(X, F) = X; — X for j= 1,..., B. Even 
though X is unbiased, random selection of pseudo-datasets is unlikely to produce a 
set of R(X*, F ) values whose mean is exactly zero. The ordinary bootstrap exhibits 
unnecessary Monte Carlo variation in this case. 

However, if each data value occurs in the combined collection of bootstrap 
pseudo-datasets with the same relative frequency as it does in the observed data, 
then the bootstrap bias estimate (1/B) Eee 1 RAG, F ) must equal zero. By balanc- 
ing the bootstrap data in this manner, a source of potential Monte Carlo error is 
eliminated. 

The simplest way to achieve this balance is to concatenate B copies of the 
observed data values, randomly permute this series, and then read off B blocks of sizen 
sequentially. The jth block becomes ¥ is This is the balanced bootstrap—sometimes 
called the permutation bootstrap [143]. More elaborate balancing algorithms have 
been proposed [253], but other methods of reducing Monte Carlo error may be easier 
or more effective [183]. 


9.4.2 Antithetic Bootstrap 


For a sample of univariate data, x1, ..., Xn, denote the ordered data as x(1),..., X(n)» 
where xq) is the value of the ith order statistic (i.e., the ith smallest data value). Let 
(i) =n — i+ 1 be a permutation operator that reverses the order statistics. Then 
for each bootstrap dataset V* = {Xj¥,..., X*}, let V** = {X7*,..., X**} denote the 
dataset obtained by substituting X(q,jy) for every instance of X(j) in V*. Thus, for 
example, if ¥* has an unrepresentative predominance of the larger observed data 
values, then the smaller observed data values will predominate in ¥**. 

Using this strategy, each bootstrap draw provides two estimators: R(¥*, F) and 
R(X, F). These two estimators will often be negatively correlated. For example, 
if R is a statistic that is monotone in the sample mean, then negative correlation is 
likely [409]. 
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Let Ra(&*, F) = 4(R(4*, F) + R(X**, F)). Then Ra has the desirable prop- 
erty that it estimates the quantity of interest with variance 


var{Ra(X*, F)} = 1 (var(R(X*, F)} + var{ R(X, F)} 
+2 cov{R(X*, F), R(X**, PY) 
< var{R(X*, F)} (9.22) 


if the covariance is negative. 
There are clever ways of establishing orderings of multivariate data, too, to 
permit an antithetic bootstrap strategy [294]. 
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A critical requirement for the validity of the above methods is that it must be reason- 
able to assume that the bootstrapped quantities are i.i.d. With dependent data, these 
approaches will produce a bootstrap distribution F* that does not mimic F because 
it fails to capture the covariance structure inherent in F. 

Assume that data x;,...,X, comprise a partial realization from a station- 
ary time series of random variables Xj,...,X,,... with the finite dimensional 
joint distribution function of the random variables {X,,...,X,} denoted F. For 
a time series (Xj,..., Xy,...), stationarity implies that the joint distribution of 
{X;, X41, -.-, Xr+} does not depend on ¢ for any k > 0. We also assume that 
the process is weakly dependent in the sense that {X;: t < t} is independent of 
{X;: t > t+k} in the limit as k > oo for any t. Let X = (X1,..., Xn) denote the 
time series we wish to bootstrap, and hereafter we denote series with (-) and unordered 
sets with {-}. 

Since the elements of X are dependent, it is inappropriate to apply the ordinary 
bootstrap for i.i.d. data. This is obvious since Fx, x, # TT , Fx; under depen- 
dence. As a specific example, consider bootstrapping X with mean m. In the case 
of dependent data, nvar{X — u} equals var{X,} plus many covariance terms. How- 
ever nvar*{X* — X} > var{X 1} as n > oo where var* represents the variance with 
respect to the distribution F. Thus the covariance terms would be lost in the i.i.d. 
bootstrap. Also see Example 9.9. Hence, applying the i.i.d. bootstrap to dependent 
data cannot even ensure consistency [601]. 

Several bootstrap methods have been developed for dependent data. Bootstrap 
theory and methods for dependent data are more complex than for the i.i.d. case, but 
the heuristic of resampling the data to generate values of T( F*) for approximating the 
sampling distribution of T(F) is the same. Comprehensive discussion of bootstrapping 
methods for dependent data is given by [402]. A wide variety of methods have been 
introduced by [81, 93, 94, 396, 425, 498, 512, 513, 529, 590, 591]. 
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9.5.1 Model-Based Approach 


Perhaps the simplest context for bootstrapping dependent data is when a time se- 
ries is known to be generated from a specific model such as the first-order station- 
ary autoregressive process, that is, the AR(1) model. This model is specified by 
the relation 


Xt = &Xı—ı + & (9.23) 


where |a| < 1 and the €; are i.i.d. random variables with mean zero and constant 
variance. If the data are known to follow or can be assumed to follow an AR(1) process, 
then a method akin to bootstrapping the residuals for linear regression (Section 9.2.3) 
can be applied. 

Specifically, after using a standard method to estimate a (see, e.g., [129]), define 
the estimated innovations to be &; = X; — @X;_1 for t = 2,...,n, and let z be the 
mean of these. The @; can be recentered to have mean zero by defining €; = @; — ë. 
Bootstrap iterations should then resample n + 1 values from the set {€2,... , €n} with 
replacement with equal probabilities to yield a set of pseudo-innovations {€G, ..., €% }- 
Given the model (and |a@| < 1), a pseudo-data series can be reconstructed using Xý = 
e5 and X¥ = @Xf_, + ef fort =1,...,n. 

When generated in this way, the pseudo-data series is not stationary. One remedy 
is to sample a larger number of pseudo-innovations and to start generating the data 
series “earlier,” that is, from X% for k much less than 0. The first portion of the 
generated series (t = k,...,0) can then be discarded as a burn-in period [402]. As 
with any model-based bootstrap procedure, good performance for this approach is 
dependent on the model being appropriate. 


9.5.2 Block Bootstrap 


Most often, a model-based approach should not be applied, so a more general method 
is needed. Many of the most common approaches to bootstrapping with dependent 
data rely on notions of blocking the data in order to preserve the covariance structure 
within each block even though that structure is lost between blocks once they are 
resampled. We begin by introducing the nonmoving and moving block bootstraps. 
It is important to note that our initial presentation of these methods omits several 
refinements like additional blocking, centering and studentizing that help ensure the 
best possible performance. We introduce those topics in Sections 9.5.2.3 and 9.5.2.4. 


9.5.2.1 Nonmoving Block Bootstrap Consider estimating an unknown quan- 
tity 0 = T(F) using the statistic ð= T( F) where F is the empirical distribution func- 
tion of the data. A bootstrap resampling approach will be used to estimate the sam- 
pling distribution of a by obtaining a collection of bootstrap pseudo-estimates oF 
fori =1,...,m. Each oF is computed as T(F*) where F* denotes the empirical 
distribution function of a pseudo-dataset X¥*. These X* must be generated in a man- 
ner that respects the correlation structure in the stochastic process that produced the 
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FIGURE 9.4 Time series of mean changes in gross domestic product (GDP) for 16 industri- 
alized countries for 1871—1910. The horizontal line is the overall mean, X = 2.6875. 


original data X. A simple approximate method that attempts to achieve this goal is 
the nonmoving block bootstrap [93]. 

Consider splitting ¥ = (X1,..., Xn) into b nonoverlapping blocks of length 
l, where for simplicity hereafter we assume /b = n. Denote these blocks as B; = 
(XG—pi4i,..-, Xz) fori = 1,..., b. The simplest nonmoving block bootstrap begins 
by sampling 6;',..., BF independently from {6), ..., Bp} with replacement. These 
blocks are then concatenated to form a pseudo-dataset 1* = (By, ..., B;). Replicat- 
ing this process B times yields a collection of bootstrap pseudo-datasets denoted %;* 
fori = 1,..., B. Each bootstrap pseudo-value oF is computed from a corresponding 
X;* and the distribution of 0 is approximated by the distribution of these B pseudo- 
values. Although this bootstrap procedure is simple, we will discuss shortly why it is 
not the best way to proceed. 

First, however, let us consider a simple example. Suppose n = 9, l = 3, 
b = 3, and ¥ = (X1, ..., X9) = (1, 2, 3, 4,5, 6, 7, 8, 9). The blocks would be B; = 
(1, 2, 3), B2 = (4, 5, 6), and 53 = (7, 8, 9). Independently sampling these blocks with 
replacement and reassembling the result might yield 4* = (4,5, 6, 1, 2, 3, 7, 8, 9). 
The order within blocks must be retained, but the order in which the blocks are re- 
assembled doesn’t matter because Xis stationary. Another possible bootstrap sample 
is ¥* = (1,2,3,1,2,3,4,5, 6). 


Example 9.9 (Industrialized Countries GDP) The website for this book contains 
data on the average percent change in gross domestic product (GDP) for 16 indus- 
trialized countries for the n = 40 years from 1871 to 1910, derived from [431]. The 
data are shown in Figure 9.4. 
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FIGURE 9.5 Histogram of B = 10,000 bootstrap estimates X * from Example 9.9. 


Let 6 = X estimate the mean GDP change over the period. The variance of this 
estimator is 


1 n—1 i 
var{X} = z (vax + 25 (1 — ~) cov{X1, vl 3 (9.24) 


i=1 


Let b = 5 and / = 8. Figure 9.5 shows a histogram of B = 10,000 bootstrap 
estimates X;* for i = 1,..., B using the nonmoving block bootstrap. The sample 
standard deviation of these values is 0.196. 

Because most of the dominant covariance terms in Equation (9.24) are negative, 
the sample standard deviation generated by the i.i.d. approach will be larger than the 
one from the block bootstrap approach. In this example, the i.i.d. approach (which 
corresponds to / = 1, b = 40) yields 0.3372. 


9.5.2.2 Moving Block Bootstrap The nonmoving block bootstrap uses sequen- 
tial disjoint blocks that partition X. This choice is inferior to the more general strategy 
employed by the moving block bootstrap [396]. With this approach, all blocks of l 
adjacent X, are considered, regardless of whether the blocks overlap. Thus we de- 
fine 5; = (Xj, ..., Xj47-1) fori = 1,...,n — l + 1. Resample these blocks indepen- 
dently with replacement, obtaining 6", ..., 6; where again we make the convenient 
assumption that n = lb. After arranging the B;* end to end in order to assemble ¥*, a 
pseudo-estimate o* = T( F*) is produced. Replicating this process B times provides 
a bootstrap sample of 0* values fori = 1,..., B. For the case where ¥ = (1,..., 9), 
a possible bootstrap series ¥* is (1, 2, 3, 2,3, 4, 6, 7, 8), formed from the two over- 
lapping blocks (1, 2, 3) and (2, 3, 4) and the additional block (6, 7, 8). 
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Example 9.10 (Industrialized Countries GDP, Continued) For the previous 
GDP dataset, the moving blocks bootstrap with / = 8 yields an estimated standard 
deviation of 0.188. For comparison, the moving and nonmoving bootstrap applica- 
tions were replicated 20,000 times to assess the expected performance of the two 
procedures. The medians (and standard deviations) of the bootstrap estimates of the 
standard deviation were 0.187 (0.00125) and 0.196 (0.00131) for the nonmoving 
and moving block approaches, respectively. In principle, the moving block bootstrap 
should outperform its nonmoving block counterpart; see Section 9.6.2. 


9.5.2.3 Blocks-of-Blocks Bootstrapping Above we have sidestepped a key 
issue for the block bootstrap. Our example using X is not sufficiently general be- 
cause the distribution of the sample mean depends only on the univariate marginal 
distribution of X;. For dependent data problems, many important parameters of 
interest pertain to the covariance structure inherent in the joint distribution of 
several X;. 

Notice that the serial correlation in ¥ will (usually) be broken in ¥* at each 
point where adjacent resampled blocks meet as they are assembled to construct ¥*. 
If the parameter 0 = T(F) is related to a p-dimensional distribution, a naive mov- 
ing or nonmoving block bootstrap will not replicate the targeted covariance struc- 
ture because the pseudo-dataset will resemble white noise more than the original 
series did. 

For example, consider the lag 2 autocovariance p2 = E{(X; — EX;)(X142 — 
EX,)}. This depends on the distribution function of the trivariate random variable 
(X;, X41, X42). An appropriate block bootstrapping technique would ensure that 
each pseudo-estimate p3 is estimated only from such triples. This would eliminate 
the instances in 1* where X* and Xi, are not lag 2 adjacent to each other in the 
original data. Without such a strategy, there would be as many as b — | inappropriate 
contributions to ož. 

The remedy is the blocks-of-blocks bootstrap. Let Y ; = (Xj, ..., X j+p-1) for 
j=1,...,2— p + 1.These Y ; now constitute anew series of p-dimensional random 
variables to which a block bootstrap may be applied. Furthermore, the sequence 
V = {Y;} is stationary and we may now reexpress 0 and 6 as Ty(Fy) and Ty(Fy), 
respectively. Here Fy is the distribution function of Y and Ty is a reexpression of 
T that enables the functional to be written in terms of Fy so that the estimator is 
calculated using y rather than ¥. 

For a nonmoving block bootstrap, then, Y = (Y1, ..., Yn—p41) is partitioned 
into b adjacent blocks of length /. Denote these blocks as 6}, . . . , B,. These blocks are 
resampled with replacement, and appended end-to-end to form a pseudo-dataset V“. 
Each )* yields a pseudo-estimate = Ty(F*), where Fe is the empirical distribution 
function of )*. 

For example, let n = 13, b = 4, l = 3, p = 2, and ¥ = (1,2,..., 13). Then 


00-0) 
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For the nonmoving blocks-of-blocks approach, the four nonoverlapping blocks of 
blocks would be 


s(a E) pst (2) (5) 


One potential blocks-of-blocks nonmoving bootstrap dataset would be 


rA) Ced) G) 
(2)-(3)-() Cr) Cr) (rn) 


The blocks-of-blocks approach for the moving block bootstrap proceeds anal- 
ogously. In this case, there are n — p+ 1 blocks of size p. These blocks overlap, 
so adjacent blocks look like (X;,..., X:+ »—1) and (X;41,..., Xt+p). In the above 
example, the first two of 10 blocks of blocks would be 


eG) 816) GH} 


One potential pseudo-dataset would be 


rA) eC) Cea) 
()-()-()- 0-0) 


The blocks-of-blocks strategy is implicit in the rest of our block bootstrap dis- 
cussion. However, there are situations where vectorizing the data to work with the Y, 
or reexpressing T as Ty is difficult or awkward. When these challenges become too 
great an impediment, a pragmatic solution is to adopt the naive approach correspond- 
ing to p= 1. 


Example 9.11 (Tree Rings) The website for this book provides a dataset related 
to tree rings for the long-lived bristlecone pine Pinus longaeva at Campito Mountain 
in California. Raw basal area growth increments are shown in Figure 9.6 for one 
particular tree with rings corresponding to the n = 452 years from 1532 to 1983. The 
time series considered below has been detrended and standardized [277]. 
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FIGURE 9.6 Raw bristlecone pine basal area growth increments for the years 1532-1983 
discussed in Example 9.11. 


Consider estimating the standard error of the lag 2 autocorrelation of the basal 
area increments, that is, the correlation between X; and X;+2. The sample lag 2 
autocorrelation is F = 0.183. To apply the blocks-of-blocks method we must use 
p = 3 so that each small block includes both X; and X;42 for t = 1, ..., 450. 

Thus ¥ yields 450 triples Y, = (X;, X;41, X42) and the vectorized series is 
Y= (Yı, ..., Y450). From these 450 blocks, we may resample blocks of blocks. Let 
each of these blocks of blocks be comprised of 25 of the small blocks. The lag 2 
correlation can be estimated as 


450 452 
F= Y Yn- MW,- M)/ YQ,- MY? 
t=1 t=1 
where Y, ; is the jth element in Y, and M is the mean of X1, ..., Xn. The denominator 


and M are expressed here in terms of ¥ for brevity, but they can be reexpressed in 
terms of Y so that? = Ty(Fy). 

Applying the moving blocks-of-blocks bootstrap by resampling the Y, and 
assembling a pseudo-dataset )* yields a bootstrap estimate 7* for each į = 1,..., B. 
The standard deviation of the resulting rj, that is, the estimated standard error of T, 


is 0.51. A bootstrap bias estimate is —0.008 (see Section 9.2.4). 


9.5.2.4 Centering and Studentizing The moving and nonmoving block boot- 
strap methods yield different bootstrap distributions for Oo. To see this, consider when 
0 = EX; andô = X. For the nonmoving block bootstrap, assume that n = /b and note 
that the blocks B* are i.i.d., each with probability 1/b. Let E* represent expectation 
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with respect to the block bootstrap distribution. Then 


l 

1 
*(6") x* 
Oe T 


(9.25) 


l 
n 1 
*(D*\ — Tk * 
E*@)=E 7 x* 


I-1 
— {kj T LU DOG + Xa- woh (9.26) 


The second term in the braces in (9.26) accounts for the fact that observations within / 
positions of either end of the series occur in fewer blocks and hence contribute fewer 
terms to the double sum above. In other words, the moving block bootstrap exhibits 
edge effects. Note, however, that the mean squared difference between bootstrap 
means is O(l/n?) so the difference vanishes as n —> co. 

There is an important implication of the fact that 6* is unbiased for the nonmov- 
ing block bootstrap but biased for the moving block approach. Suppose we intend 
to apply the moving block bootstrap to a pivoted quantity such as @ — 6. One would 
naturally consider the bootstrap version 6* — 6. However, E*{(o* — 0} + 0, and this 
error converges to zero at a slow rate that is unnecessary given the approach described 
in the next paragraph. 

The improvement is to center using 6* — E*6*. For the sample mean, E*6* is 
given in (9.26). This alternative centering could present a significant new hurdle for 
applying the moving blocks bootstrap to a more general statistic ð = T(F ) because 
the calculation of E*6* can be challenging. Fortunately, it can be shown that under 
suitable conditions it suffices to apply the pivoting approach Ox *) — @(E* X*) when 
bootstrapping any statistic that can be expressed as ð= OX )if @ is a smooth function 
[140, 275, 398]. This is called the smooth function model, which is a common context 
in which to study and summarize asymptotic performance of block bootstrap methods. 

Studentizing the statistic by scaling it with its estimated standard deviation suf- 
fers from an analogous problem. Recognizing the smooth function result above, let us 
make the simplifying assumption that 0 = X and limit consideration to the nonmoving 
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bootstrap. A natural studentization would be seem to be (X* — E* X*)/(s*/,/n) 
where s* is the standard deviation of the bootstrap data ¥*. However, s* is not a 
good approximation to var*{X* — E* X*} [296, 312]. The improvements 


b l l 
1 = 2 
B= Kits -Xu — E) (9.27) 
i=1 j=1 k=1 
and 
1 b-1 1 l 

2-5 do eK — XY Kine — X (9.28) 

i=0 j=l k=1 


are suggested alternatives [275, 312, 399]. Either is adequate. 

Another way to correct for edge effects is the circular block bootstrap [512]. 
This approach extends the observed time series by defining “new” observations 
pai = X; for 1 <i < b—1, which are concatenated to the end of the original 
series. Then overlapping blocks are formed from the “wrapped” series in the same 
manner as for the moving blocks bootstrap. These blocks are resampled independently 
with replacement with equal probabilities. Since each X; (1 < i < n) in the original 
X now occurs exactly n times in the extended collection of blocks, the edge effect is 
eliminated. 

The stationary block bootstrap tackles the same edge effect issue by using 
blocks of random lengths [513]. The block starting points are chosen i.i.d. over 
{1,...,}. The ending points are drawn according to the geometric distribution given 
by Pl[endpoint = j] = p(1 — p)/~!. Thus block lengths are random with a condi- 
tional mean of 1/p. The choice of p is a challenging question; however, simulations 
show that stationary block bootstrap results are far less sensitive to the choice of p 
than is the moving blocks bootstrap to the choice of / [513]. Theoretically, it suffices 
that p > 0 and np —> oasn —> oo. From a practical point of view, p = 1/1 can be 
recommended. The term stationary block bootstrap is used to describe this method 
because it produces a stationary time series, whereas the moving and nonmoving 
block bootstraps do not. 


9.5.2.5  BlockSize Performance of ablock bootstrap technique depends on block 
length, /. When/ = 1, the method corresponds to the i.i.d. bootstrap and all correlation 
structure is lost. For very large /, the autocorrelation is mostly retained but there will 
be few blocks from which to sample. Asymptotic results indicate that, for the block 
bootstrap, block length should increase as the length of the time series increases if 
the method is to produce consistent estimators of moments, correct coverage proba- 
bilities for confidence intervals, and appropriate error rates for hypothesis tests (see 
Section 9.6.2). Several approaches for choosing block length in practice have been 
suggested. We limit discussion here to two methods relevant for the moving block 
bootstrap. 

A reasonable basis for choosing block length is to consider the MSE of the 
bootstrap estimator. In this chapter, we have considered 6 = T(F) as an interesting 
feature of a distribution F, and @ as an estimator of this quantity. The statistic 6 will 
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have certain properties (features of its sampling distribution) that depend on the un- 
known F, such as var {0} or MSE{6}. The bootstrap is used to estimate such quantities. 
Yet, the bootstrap estimator itself has its own bias, variance, and mean squared error 
that again depend on F. These serve as criteria to evaluate the performance of the 
block bootstrap and consequently to compare different choices for block length. 

The MSE of a bootstrap estimator can be estimated by bootstrapping the boot- 
strap estimator. Although neither of the methods discussed below implements a nest- 
ing strategy as explicit as the one described in Section 9.3.2.4, they both adopt the 
heuristic of multilevel resampling for estimation of the optimal block length, denoted 
Jopt- An alternative approach is explored by [83]. 


Subsampling Plus Bootstrapping The approach described here is based on 
an estimate of the mean squared error of a block bootstrap estimate when @ is the 
mean or a smooth function thereof [297]. Define ¢, = bias{0} = E@ — 0) and dy, = 
var{6} = E{02} — (E0). Let p = bias* {6*} and ¢, = var*{0*} be block bootstrap 
estimates of dp and ġ,. For example, under the smooth function model with u denoting 
the true mean and 0 = H(u), bp = = ee |(H(X*) — H(X))/B where X* is the mean 
of the ith pseudo-dataset and H is the smooth function. Note that each ¢; for j € {b, v} 
depends on /, so we may write these quantities as ø;(/). Under suitable conditions, 
one can show that 


a cıl l 
var{ej(D} = +0 (5) (9.29) 
n n 
on 1 
bias{o;()} = 2 +0 (=) l (9.30) 
nl nl 
and therefore 
l e l 1 
MSE{$) ;} = Nf £0 + (9.31) 
l n? n2[2 nm RP)’ i 


for j € {b, v}, although cı and cz depend on j. Differentiating this last expression and 
solving for the / that minimizes the MSE, we find 


sa 1/3 
Cc. 
loot ~ (2) n!’ (9.32) 


c1 


where the symbol ~ is defined by the relation a, ~ bn if liMy—+oo dyn /by = 1. For 
simplicity in the rest of this section, let us focus on bias estimation, letting ¢ = @p. 
We will note later that the same result holds for variance estimation. 

The goal is to derive the block length that minimizes MSE{¢()} with respect 
to 1. We will do this by estimating MSE{$()} for several candidate values of / and 
select the best. Begin by choosing a pilot block size lọ and performing the usual block 
bootstrap to obtain Plo). Next, consider a smaller sub-dataset of size m < n for which 
we can obtain an analogous estimate, Pm (l) for some l’. The estimate of MSE{o(I' )} 
will depend on a collection of these Pm (l^) and the original (Io). 
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Let am = (Xi, ..., Xi+m—1) denote a subsequence of ¥ of length m, for 
i=1,...,n—m + 1. Applying the block bootstrap to ae using B iterations and 
trial block length /' produces a point estimate of ġ, denoted ĝi. m(l), for each i. For 
the bias example above, Pim = = DA (A(X; > — H(X;))/B where X; is the mean of 


ae and X č; is the mean of the jth bootstrap pseudo-dataset generated from Bo 
for j = 1,..., B. Then an estimate of the mean squared error of the block bootstrap 
estimator Øm(l’) based on the subset of size m is 


ete A 1 n—m+1 A vad 2 
MSEn m0) = — a D (Gin) 033 


i=1 


recalling that (lo) is the estimate obtained by bootstrapping the full dataset using a 
pilot block length lo. 


Let loa ) minimize MSE, BU )} with respect to l’. This minimum may be found 
Tn) 


by trying a sequence of l’ and selecting the best. Then lopt estimates the best block 
size for a series of length m. Since the real data series is length n and since optimal 


1/3, we must scale up Te accordingly to yield 


block size is known to be of order n 
Top = 1) = (n/m 720. 

The procedure described here applies when ¢ is the bias or variance functional. 
For estimating a distribution function, an analogous approach leads to an appropriate 
scaling factor of Topt = (n/m)!/4 1 T, 

Good choices for m and lọ are adle Choices like m ~ 0.25n and lo ~ 0.05n 
have produced reasonable simulation results in several examples [297, 402]. It is 
important that the pilot value lọ is plausible, but the effect of lọ can potentially be 
reduced through iteration. Specifically, after applying the procedure with an initial 
pilot value lo, the result may be iteratively refined by replacing the previous pilot 


value with the current estimate / T and repeating the process. 


Jackknifing Plus Bootstrapping An empirical plug-in approach has been 
suggested as an alternative to the above method [404]. Here, an application of the 
jackknife-after-bootstrap approach [180, 401] is applied to estimate properties of the 
bootstrap estimator. 

Recall the expressions for MSE{())} and lopt in Equations (9.31) and (9.32). 
Equation (9.32) identifies the optimal rate at which block size should grow with 
increasing sample size, namely proportionally to n!/3. However, a concrete choice 
for lopt cannot be made without determination of c; and c2. 

Rearranging terms in Equations (9.29) and (9.30) yields 


ci ~ nl! var {oD}, (9.34) 
cz ~ nlbias{ $0}. (9.35) 
Thus if var{(l)} and bias{d(J)} can be approximated by convenient estimators V and 


B, then we can estimate c1, c2, and hence MSE{4())}. Moreover, Equation (9.32) can 
be applied to estimate /opt. 
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The crux of this strategy is crafting the estimators B and V. One can show that 
the estimator 


B = APU) — PI) (9.36) 


is consistent for bias{A(U’ )} under suitable conditions where l’ is a chosen block length. 
The choice of l’ determines the accuracy of the estimator B. 

Calculation of V relies on a jackknife-after-bootstrap strategy [180]. Applied 
within the blocks-of-blocks context, this approach deletes an adjacent set of blocks 
and resamples the remainder. From this resample, $ is calculated. Repeating this 
process sequentially as the set of deleted blocks progresses from one end of ¥ to the 
other, one block at a time, yields the complete set of p bootstrap pseudo-values whose 
variance may be calculated and scaled up to estimate var{}. 

The details are as follows. When the moving block bootstrap is applied to a 
data sequence (X1,..., Xn), there aren — l + 1 blocks By,..., By—j41 available for 
resampling. The blocks are Bj = (Xj, ..., Xj+i-1) for j=1,...,n — l+ 1. Sup- 
pose that we delete d adjacent blocks from this set of blocks. There aren —1— d + 2 
possible ways to do this, deleting (Bj,..., Bj+g_1) for i = 1,...,n—l—d +2. 
The ith such deletion leads to the ith reduced dataset of blocks, called a block-deleted 
dataset. By performing a moving block bootstrap with block length /’ on the ith 
block-deleted dataset, the ith block-deleted value Qi can be computed via ĝi = o(F*) 
where F* is the empirical distribution function of the sample from this moving block 
bootstrap of the ith block-deleted dataset. 

However, the n — l — d + 2 separate block-deleted bootstraps considered above 
can be carried out without explicitly conducting the block deletion steps. For each i in 
turn, the collection of original bootstrap pseudo-datasets can be searched to identify all 
X* in which none of the ith set of deleted blocks are present. Then this subcollection 
of the original bootstrap pseudo-datasets can be used to calculate ĝi. An appropriate 
variance estimator based on the block-deleted data is 


n—l— gi K2 
> d (gi — 9) 
V = —____ — 9.37 

n—l—d+l > n—l-d+2 ( ) 


where 


gy = ON Denne (9.38) 


and $ is the estimate of ọ resulting from the original application of the bootstrap. 
Finally, op can be found using Equation (9.32). In this manner, the computational 
effort associated with repeated resampling is replaced by increased coding complexity 
needed to keep track of (or search for) the appropriate pseudo-datasets for each i. 
Note that the choice of d will strongly affect the performance of V as an estimator of 
var{b(opt)}- 

Under suitable conditions, B is consistent for bias{d} and V is consistent for 
var{$} when d — oo and d/n > 0 as n —> œ [404]. Yet a key part of this method 
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remains to be specified: the choices for d and lọ. The values of lọ = n'/3 and d = 
n'/3/7/3 are suggested on the basis of heuristic arguments and simulation [401, 403, 
404]. An iterative strategy to refine lo is also possible. 

These results pertain to cases when estimating the best block length for boot- 
strap estimation of bias or variance. Analogous arguments can be used to address 
the situation when ¢ represents a quantile. In this case, assuming studentization, 
lopt ~ (ley en and suggested starting values are lọ = n'/° and d = 0.1n!3P/9 
[404]. 
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9.6.1 Independent Data Case 


All the bootstrap methods described in this chapter rely on the principle that the boot- 
strap distribution should approximate the true distribution for a quantity of interest. 
Standard parametric approaches such as a t-test and the comparison of a log likelihood 
ratio to a x° distribution also rely on distributional approximation. 

We have already discussed one situation where the i.i.d. bootstrap approxima- 
tion fails: for dependent data. The bootstrap also fails for estimation of extremes. 
For example, bootstrapping the sample maximum can be catastrophic; see [142] for 
details. Finally, the bootstrap can fail for heavy-tailed distributions. In these circum- 
stances, the bootstrap samples outliers too frequently. 

There is a substantial asymptotic theory for the consistency and rate of con- 
vergence of bootstrap methods, thereby formalizing the degree of approximation it 
provides. These results are mostly beyond the scope of this book, but we mention a 
few main ideas below. 

First, the i.i.d. bootstrap is consistent under suitable conditions [142]. Specif- 
ically, consider a suitable space of distribution functions containing F, and let 
Nr denote a neighborhood of F into which F eventually falls with probabil- 
ity 1. If the distribution of a standardized R(X, G) is uniformly weakly conver- 
gent when the elements of ¥ are drawn from any G € Np, and if the map- 
ping from G to the corresponding limiting distribution of R is continuous, then 
P* [| PLRCe, F) < q] — P[R(X, F) < q| > e| — Oforany cand any qasn —> oo. 

Edgeworth expansions can be used to assess the rate of convergence [295]. 
Suppose that R(X, F) is standardized and asymptotically pivotal, when R(X, F) is 
asymptotically normally distributed. Then the usual rate of convergence for the boot- 
strap is given by P*[R(&*, F) <q] — PIR(X, F) < q] = Opn’). Without pivot- 
ing, the rate is typically only © pint! 2), In other words, coverage probabilities for 
confidence intervals are O(n~ '/*) accurate for the basic, unpivoted percentile method, 
but O(n—') accurate for BC, and the bootstrap t. The improvement offered by the 
nested bootstrap depends on the accuracy of the original interval and the type of 
interval. In general, nested bootstrapping can reduce the rate of convergence of cov- 
erage probabilities by an additional multiple of n~!/? or n~!. Most common infer- 
ential problems are covered by these convergence results, including estimation of 
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smooth functions of sample moments and solutions to smooth maximum likelihood 
problems. 

It is important to note that asymptotic approaches such as the normal approx- 
imation via the central limit theorem are O(n~!/*) accurate. This illustrates the 
benefit of standardization when applying bootstrap methods because the conver- 
gence rate for the bootstrap in that case is superior to what can be achieved using 
ordinary asymptotic methods. Accessible discussions of the increases in conver- 
gence rates provided by BC,, the nested bootstrap, and other bootstrap improve- 
ments are given in [142, 183]. More advanced theoretical discussion is also available 
[47, 295, 589]. 


9.6.2 Dependent Data Case 


Under suitable conditions, the dependent data bootstrap methods discussed here are 
also consistent. The convergence performance of these methods depends on whether 
block length / is the correct order (e.g., / œ n'/3 for bias and variance estimation). In 
general, performance of block bootstrap methods when incorporating studentization 
is superior to what is achieved by normal approximation via the central limit theorem, 
but not as good as the performance of the bootstrap for i.i.d. data. 

Not all dependent data bootstrap methods are equally effective. The moving 
block bootstrap is superior to the nonmoving block approach in terms of mean squared 
error. Suppose that bootstrapping is focused on estimating the bias or variance of an 
underlying estimator. Then the asymptotic mean squared error (AMSE) is 1.57/37 ~ 
31% larger for the nonmoving blocks bootstrap than for the moving blocks method 
when the asymptotically optimal block sizes are used for each approach [297, 400]. 
The difference is attributable to the contribution of variances to AMSE; the bias terms 
for the two methods are the same. Both AMSEs converge to zero at the same rate. 

More sophisticated bootstrapping methods for dependent data can offer better 
asymptotic performance but are considerably more cumbersome and sometimes lim- 
ited to applications that are less general than those that can be addressed with one of the 
block methods described above. The tapered block bootstrap seeks to reduce the bias 
in variance estimation by down-weighting observations near the edges of blocks [498, 
499]. The sieve bootstrap aims to approximate the data generation process by initially 
fitting an autoregressive process. Recentered residuals are then resampled and used to 
generate bootstrap datasets * from the fitted model via a recursion method for which 
the impact of initializing the process is washed away as iterations increase [81, 82, 
393]. The dependent wild bootstrap shares the superior asymptotic properties of the 
tapered block bootstrap and can be extended to irregularly spaced time series [590]. 


9.7 OTHER USES OF THE BOOTSTRAP 


By viewing ¥* as a random sample from a distribution F with known parameter ə, 
the bootstrap principle can be seen as a tool used to approximate the likelihood func- 
tion itself. Bootstrap likelihood [141] is one such approach, which has connections 
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to empirical likelihood methods. By ascribing random weights to likelihood compo- 
nents, a Bayesian bootstrap can be developed [558]. A generalization of this is the 
weighted likelihood bootstrap, which is a powerful tool for approximating likelihood 
surfaces in some difficult circumstances [485]. 

The bootstrap is generally used for assessing the statistical accuracy and preci- 
sion of an estimator. Bootstrap aggregating, or bagging, uses the bootstrap to improve 
the estimator itself [63]. Suppose that the bootstrapped quantity, R(X, F), depends 
on F only through 0. Thus, the bootstrap values of R(X, 0) are R(X * 6). In some 
cases, 0 is the result of a model-fitting exercise where the form of the model is un- 
certain or unstable. For example, classification and regression trees, neural nets, and 
linear regression subset selection are all based on models whose form may change 
substantially with small changes to the data. 

In these cases, a dominant source of variability in predictions or estimates 
may be the model form. Bagging consists of replacing 6 with 0* = (1/B) DE 0;. 


where 0; is the parameter estimate arising from the jth bootstrap pseudo-dataset. 
Since each bootstrap pseudo-dataset represents a perturbed version of the original 
data, the models fit to each pseudo-dataset can vary substantially in form. Thus 0 
provides a sort of model averaging that can reduce mean squared estimation error in 
cases where perturbing the data can cause significant changes to 6. A review of the 
model-averaging philosophy is provided in [331]. 

A related strategy is the bootstrap umbrella of model parameters, or bumping 
approach [632]. For problems suitable for bagging, notice that the bagged average is 
not always an estimate from a model of the same class as those being fit to the data. 
For example, the average of classification trees is not a classification tree. Bumping 
avoids this problem. 

Suppose that A(0, X) is some objective function relevant to estimation in the 
sense that high values of h correspond to @ that are very consistent with æ. For 
example, h could be the log likelihood function. The bumping strategy generates 
bootstrap pseudo-values via 0; = arg maxg h(O, Xi). The original dataset is included 
among the bootstrap pseudo-datasets, and the final estimate of @ is taken to be the 
0 j that maximizes h(6, X) with respect to 6. Thus, bumping is really a method for 
searching through a space of models (or parameterizations thereof) for a model that 
yields a good estimator. 


9.8 PERMUTATION TESTS 


There are other important techniques aside from the bootstrap that share the underlying 
strategy of basing inference on “experiments” within the observed dataset. Perhaps 
the most important of these is the classic permutation test that dates back to the era of 
Fisher [194] and Pitman [509, 510]. Comprehensive introductions to this field include 
[173, 271, 439]. The basic approach is most easily explained through a hypothetical 
example. 


Example 9.12 (Comparison of Independent Group Means) Consider a medical 
experiment where rats are randomly assigned to treatment and control groups. The 
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outcome X; is then measured for the ith rat. Under the null hypothesis, the outcome 
does not depend on whether a rat was labeled as treatment or control. Under the 
alternative hypothesis, outcomes tend to be larger for rats labeled as treatment. 

A test statistic T measures the difference in outcomes observed for the two 
groups. For example, T might be the difference between group mean outcomes, 
having value ft, for the observed dataset. 

Under the null hypothesis, the individual labels “treatment” and “control” are 
meaningless because they have no influence on the outcome. Since they are mean- 
ingless, the labels could be randomly shuffled among rats without changing the joint 
null distribution of the data. Shuffling the labels creates a new dataset: Although one 
instance of each original outcome is still seen, the outcomes appear to have arisen 
from a different assignment of treatment and control. Each of these permuted datasets 
is as likely to have been observed as the actual dataset, since the experiment relied on 
random assignment. 

Let t be the value of the test statistic computed from the dataset with this 
first permutation of labels. Suppose all M possible permutations (or a large number 
of randomly chosen permutations) of the labels are examined, thereby obtaining 
to,..., ty. 

Under the null hypothesis, t2, ..., ty were generated from the same distribution 
that yielded t,. Therefore, t; can be compared to the empirical quantiles of t1, ..., tm 
to test a hypothesis or construct confidence limits. 


To pose this strategy more formally, suppose that we observe a value ¢ for a 
test statistic T having density f under the null hypothesis. Suppose large values of 
T indicate that the null hypothesis is false. Monte Carlo hypothesis testing proceeds 
by generating a random sample of M — 1 values of T drawn from f. If the observed 
value ¢ is the kth largest among all M values, then the null hypothesis is rejected at 
a significance level of k/M. If the distribution of the test statistic is highly discrete, 
then ties found when ranking ¢ can be dealt with naturally by reporting a range of 
p-values. Barnard [22] posed the approach in this manner; interesting extensions are 
offered in [38, 39]. 

There are a variety of approaches for sampling from the null distribution of the 
test statistic. The permutation approach described in Example 9.12 works because 
“treatment” and “control” are meaningless labels assigned completely at random and 
independent of outcome, under the null hypothesis. This simple permutation approach 
can be broadened for application to a variety of more complicated situations. In all 
cases, the permutation test relies heavily on the condition of exchangeability. The 
data are exchangeable if the probability of any particular joint outcome is the same 
regardless of the order in which the observations are considered. 

There are two advantages to the permutation test over the bootstrap. First, if 
the basis for permuting the data is random assignment, then the resulting p-value 
is exact (if all possible permutations are considered). For such experiments, the 
approach is usually called a randomization test. In contrast, standard parametric ap- 
proaches and the bootstrap are founded on asymptotic theory that is relevant for large 
sample sizes. Second, permutation tests are often more powerful than their boot- 
strap counterparts. However, the permutation test is a specialized tool for making 
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TABLE 9.3 Forty years of fishery data: numbers of recruits (R) and spawners (S). 


R S R S R S R S 
68 56 222 351 311 412 244 265 
77 62 205 282 166 176 222 301 
299 445 233 310 248 313 195 234 
220 279 228 266 161 162 203 229 
142 138 188 256 226 368 210 270 
287 428 132 144 67 54 275 478 
276 319 285 447 201 214 286 419 
115 102 188 186 267 429 275 490 

64 51 224 389 121 115 304 430 


206 289 121 113 301 407 214 235 


a comparison between distributions, whereas a bootstrap tests hypotheses about pa- 
rameters, thereby requiring less stringent assumptions and providing greater flex- 
ibility. The bootstrap can also provide a reliable confidence interval and standard 
error, beyond the mere p-value given by the permutation test. The standard deviation 
observed in the permutation distribution is not a reliable standard error estimate. Ad- 
ditional guidance on choosing between a permutation test and a bootstrap is offered in 
[183, 271, 272]. 


PROBLEMS 


9.1. Let X),...,X, ~ iid. Bernoulli(6). Define R(X, F) = X — 6 and R* = R(A*, P), 
where ¥* is a bootstrap pseudo-dataset and F is the empirical distribution of the data. 
Derive the exact E*{R*} and var*{R*} analytically. 


9.2. Suppose 0 = g(u), where g is a smooth function and u is the mean of the distribution 
from which the data arise. Consider bootstrapping R(X, F) = g(X) — g(u). 


a. Show that E*{X"} =X and var*{X } = [lo /n, where py = ae (x; — x). 
b. Use Taylor series to show that 

NX a W(X on 
BO, Wes 


ERA", P) = = ae 


and 


var*{R(X*, F)} = 


Lh L'E fa a 
H2 +e. 
n n 


4n? 


9.3. Justify the choice of b for BC, given in Section 9.3.2.1. 


9.4. Table 9.3 contains 40 annual counts of the numbers of recruits and spawners in a salmon 
population. The units are thousands of fish. Recruits are fish that enter the catchable 
population. Spawners are fish that are laying eggs. Spawners die after laying eggs. 
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TABLE 9.4 Survival Times (Days) for Patients with Two Types of Terminal Cancer 


Stomach 25 42 45 46 51 103 124 
146 340 396 412 876 1112 


Breast 24 40 719 727 791 1166 1235 
1581 1804 3460 3808 


The classic Beverton—Holt model for the relationship between spawners and 
recruits is 


1 


R= ——\., > 0 and > 0, 
Bi + B/S By b2 


where R and S are the numbers of recruits and spawners, respectively [46]. This model 
may be fit using linear regression with the transformed variables 1/R and 1/S. 

Consider the problem of maintaining a sustainable fishery. The total population 
abundance will only stabilize if R = S. The total population will decline if fewer recruits 
are produced than the number of spawners who died producing them. If too many 
recruits are produced, the population will also decline eventually because there is not 
enough food for them all. Thus, only some middle level of recruits can be sustained 
indefinitely in a stable population. This stable population level is the point where the 
45° line intersects the curve relating R and S. 


a. Fit the Beverton—Holt model, and find a point estimate for the stable population 
level where R = S. Use the bootstrap to obtain a corresponding 95% confidence 
interval and a standard error for your estimate, from two methods: bootstrapping 
the residuals and bootstrapping the cases. Histogram each bootstrap distribution, 
and comment on the differences in your results. 


b. Provide a bias-corrected estimate and a corresponding standard error for the 
corrected estimator. 


c. Use the nested bootstrap with prepivoting to find a 95% confidence interval for the 
stabilization point. 


Patients with advanced terminal cancer of the stomach and breast were treated with 
ascorbate in an attempt to prolong survival [87]. Table 9.4 shows survival times (days). 
Work with the data on the log scale. 


a. Use the bootstrap t and BC, methods to construct 95% confidence intervals for the 
mean survival time of each group. 


b. Use a permutation test to examine the hypothesis that there is no difference in mean 
survival times between groups. 


c. Having computed a reliable confidence interval in (a), let us explore some possible 
missteps. Construct a 95% confidence interval for the mean breast cancer survival 
time by applying the simple bootstrap to the logged data and exponentiating the 
resulting interval boundaries. Construct another such confidence interval by apply- 
ing the simple bootstrap to the data on the original scale. Compare with (a). 
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9.6. The National Earthquake Information Center has provided data on the number of earth- 


9.7. 


9.8. 


9.9. 


quakes per year exceeding magnitude 7.0 for the years from 1900 to 1998 [341]. These 
data are available from the website for this book. Difference the data so that the number 
for each year represents the change since the previous year. 


a. Determine a suitable block length for bootstrapping in this problem. 


b. Estimate the 90th percentile of the annual change. Estimate the standard error of 
this estimate using the moving block bootstrap. 


c. Apply the model-based approach of Section 9.5.1, assuming an AR(1) model, to 
estimate the standard error from part (b). 


d. Estimate the lag-1 autocorrelation of the annual change. Find the bootstrap bias and 
standard error of this estimate using the moving block bootstrap with an appropriate 
blocks-of-blocks strategy. 


Use the problem of estimating the mean of a standard Cauchy distribution to illustrate 
how the bootstrap can fail for heavy-tailed distributions. Use the problem of estimating 
0 for the Unif(0, 0) distribution to illustrate how the bootstrap can fail for extremes. 


Perform a simulation experiment on an artificial problem of your design, to compare 
the accuracy of coverage probabilities and the widths of 95% bootstrap confidence 
intervals constructed using the percentile method, the BC, method, and the bootstrap 
t. Discuss your findings. 


Conduct an experiment in the same spirit as the previous question to study block 
bootstrapping for dependent data, investigating the following topics. 
a. Compare the performance of the moving and nonmoving block bootstraps. 


b. Compare the performance of the moving block bootstrap for different block lengths 
l, including one choice estimated to be optimal. 


c. Compare the performance of the moving block bootstrap with and without 
studentization. 


CHAPTER 1 0 


NONPARAMETRIC DENSITY 
ESTIMATION 


This chapter concerns estimation of a density function f using observations of random 
variables X1, ..., Xn sampled independently from f. Initially, this chapter focuses on 
univariate density estimation. Section 10.4 introduces some methods for estimating 
a multivariate density function. 

In exploratory data analysis, an estimate of the density function can be used to 
assess multimodality, skew, tail behavior, and so forth. For inference, density estimates 
are useful for decision making, classification, and summarizing Bayesian posteriors. 
Density estimation is also a useful presentational tool since it provides a simple, 
attractive summary of a distribution. Finally, density estimation can serve as a tool 
in other computational methods, including some simulation algorithms and Markov 
chain Monte Carlo approaches. Comprehensive monographs on density estimation 
include [581, 598, 651]. 

The parametric solution to a density estimation problem begins by assuming a 
parametric model, X;,..., Xn ~ i.i.d. fxjọ, where 0 is a very low-dimensional para- 
meter vector. Parameter estimates 6 are found using some estimation paradigm, such 
as maximum likelihood, Bayesian, or method-of-moments estimation. The resulting 
density estimate at x is fxjo(xl0). The danger with this approach lies at the start: 
Relying on an incorrect model fx\g can lead to serious inferential errors, regardless 


of the estimation strategy used to generate 6 from the chosen model. 

In this chapter, we focus on nonparametric approaches to density estimation 
that assume very little about the form of f. These approaches use predominantly 
local information to estimate f at a point x. More precise viewpoints on what makes 
an estimator nonparametric are offered in [581, 628]. 

One familiar nonparametric density estimator is a histogram, which is a piece- 
wise constant density estimator. Histograms are produced automatically by most soft- 
ware packages and are used so routinely that one rarely considers their underlying 
complexity. Optimal choice of the locations, widths, and number of bins is based on 
sophisticated theoretical analysis. 

Another elementary density estimator can be motivated by considering how 
density functions assign probability to intervals. If we observe a data point X; = xj, 
we assume that f assigns some density not only at x; but also in a region around x;, if 
f is smooth enough. Therefore, to estimate f from X1,..., Xn ~ ii.d. f, it makes 
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sense to accumulate localized probability density contributions in regions around 
each X;. 

Specifically, to estimate the density at a point x, suppose we consider a region 
of width dx = 2h, centered at x, where h is some fixed number. Then the proportion 
of the observations that fall in the interval y = [x — h, x + h] gives an indication 
of the density at x. More precisely, we may take fod = (jn) lie xen 
that is, 

X 1 Š 
fœ = Jin 2 Mls Kilch (10.1) 
where 1,4) = 1 if A is true, and 0 otherwise. 

Let N,(h,n) = Xai 1,\x—x;|<ny denote the number of sample points falling in 
the interval y. Then N; isa Bin(n, p(y)) random variable, where p(y) = + Gai fO dt. 
Thus E{N,/n} = p(y) and var{N,/n} = p(y) — p(y))/n. Clearly nh must in- 
crease as N, increases in order for (10.1) to provide a reasonable estimator, yet 
we can be more precise about separate requirements for n and h. The proportion 
of points falling in the interval y estimates the probability assigned to y by f. In 
order to approximate the density at x, we must shrink y by letting h —> 0. Then 
limp_s0 E{ f(x)} = limy-+o0l p(y)/(2/)] = f(x). Simultaneously, however, we want 
to increase the total sample size since var{ f. (x)} > 0 as n > œ. Thus, a funda- 
mental requirement for the pointwise consistency of the estimator Ô in (10.1) is that 
nh —> œ and h —> Oas n —> oo. We will see later that these requirements hold in far 
greater generality. 


10.1 MEASURES OF PERFORMANCE 


To better understand what makes a density estimator perform well, we must first 
consider how to assess the quality of a density estimator. Let Ô denote an estimator 
of f that is based on some fixed number A that controls how localized the probability 
density contributions used to construct Ô should be. A small / will indicate that Ô (x) 
should depend more heavily on data points observed near x, whereas a larger h will 
indicate that distant data should be weighted nearly equally to observations near x. 

To evaluate f as an estimator of f over the entire range of support, one could 
use the integrated squared error (ISE), 


ISE(h) = / (F(x) — fO dx. (10.2) 


Note that ISE(/) is a function of the observed data, through its dependence on f (x). 
Thus it summarizes the performance of f conditional on the observed sample. If we 
want to discuss the generic properties of an estimator without reference to a particular 
observed sample, it seems more sensible to further average ISE(/) over all samples 
that might be observed. The mean integrated squared error (MISE) is 


MISE(h) = E{ISE(A)}, (10.3) 
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where the expectation is taken with respect to the distribution f. Thus MISE(h) may 
be viewed as the average value of a global measure of error [namely ISE(/)] with 
respect to the sampling density. Moreover, with an interchange of expectation and 
integration, 


MISE(A) = [sen fon dx, (10.4) 
where 
By à 2 A X 
MSE; (f(x)) = E { (fœ - fe) \ = var{ f(x)} + (bias{ f09})” (10.5) 


and bias{ f(x)} = EL f(x)} — f(x). Equation (10.4) suggests that MISE(/) can also 
be viewed as accumulating the local mean squared error at every x. 

For multivariate density estimation, ISE(h) and MISE(h) are defined analo- 
gously. Specifically, ISE(h) = SLA) — f(x)dx and MISE(h) = E{ISE(h)}. 

MISE(h) and ISE(h) both measure the quality of the estimator f , and each can 
be used to develop criteria for selecting a good value for h. Preference between these 
two measures is a topic of some debate [284, 299, 357]. The distinction is essentially 
one between the statistical concepts of loss and risk. Using ISE(h) is conceptually 
appealing because it assesses the estimator’s performance with the observed data. 
However, focusing on MISE(/) is an effective way to approximate ISE-based eval- 
uation while reflecting the sensible goal of seeking optimal performance on average 
over many data sets. We will encounter both measures in the following sections. 

Although we limit attention to performance criteria based on squared error for 
the sake of simplicity and familiarity, squared error is not the only reasonable option. 
For example, there are several potentially appealing reasons to replace integrated 
squared error with the Lı norm J | Ô (x) — f(x)| dx, and MISE(h) with the corre- 
sponding expectation. Notably among these, the Lı norm is unchanged under any 
monotone continuous change of scale. This dimensionless character of Lı makes it a 
sort of universal measure of how near f is to f. Devroye and Gyorfi study the theory 
of density estimation using Lı and present other advantages of this approach [159, 
160]. In principle, the optimality of an estimator depends on the metric by which per- 
formance is assessed, so the adoption of different metrics may favor different types 
of estimators. In practice, however, many other factors generally affect the quality of 
a density estimator more than the metric that might have been used to motivate it. 
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The density estimator given in Equation (10.1) weights all points within h of x equally. 
A univariate kernel density estimator allows a more flexible weighting scheme, fitting 


N 1 x—X; 
fod = OK ( ; iF (10.6) 


where K is a kernel function and h is a fixed number usually termed the bandwidth. 
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Density 


FIGURE 10.1 Normal kernel density estimate (solid) and kernel contributions (dotted) for 
the sample x1, . . . , x4. The kernel density estimate at any x is the sum of the kernel contributions 
centered at each x;. 


A kernel function assigns weights to the contributions given by each X; to the 
kernel density estimator Ô (x), depending on the proximity of X; to x. Typically, kernel 
functions are positive everywhere and symmetric about zero. K is usually a density, 
such as a normal or Student’s ¢ density. Other popular choices include the triweight 
and Epanechnikov kernels (see Section 10.2.2), which don’t correspond to familiar 
densities. Note that the univariate uniform kernel, namely K(z) = 5 ly zl<1}» yields 
the estimator given by (10.1). Constraining K so that f z? K(z) dz = 1 allows h to 
play the role of the scale parameter of the density K, but is not required. 

Figure 10.1 illustrates how a kernel density estimate is constructed from a 
sample of four univariate observations, x1,..., x4. Centered at each observed data 
point is a scaled kernel: in this case, a normal density function divided by 4. These 
contributions are shown with the dotted lines. Summing the contributions yields the 
estimate f shown with the solid line. 

The estimator in (10.6) is more precisely termed a fixed-bandwidth kernel den- 
sity estimator because h is constant. The value chosen for the bandwidth exerts a 
strong influence on the estimator Ô . If h is too small, the density estimator will tend 
to assign probability density too locally near observed data, resulting in a very wiggly 
estimated density function with many false modes. When A is too large, the density 
estimator will spread probability density contributions too diffusely. Averaging over 
neighborhoods that are too large smooths away important features of f. 

Notice that computing a kernel density estimate at every observed sample point 
based on a sample of size n requires n(n — 1) evaluations of K. Thus, the compu- 
tational burden of calculating Ô grows quickly with n. However, for most practical 
purposes such as graphing the density, the estimate need not be computed at each 
X;. A practical strategy is to calculate f(x) over a grid of values for x, then linearly 
interpolate between grid points. A grid of a few hundred values is usually sufficient 
to provide a graph of Ô that appears smooth. An even faster, approximate method of 
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Density 


0 5 10 15 
x 
FIGURE 10.2 Histogram of 100 data points drawn from the bimodal distribution in Exam- 


ple 10.1, and three normal kernel density estimates. The estimates correspond to bandwidths 
h = 1.875 (dashed line), h = 0.625 (heavy), and h = 0.3 (solid). 


calculating the kernel density estimate relies on binning the data and rounding each 
value to the nearest bin center [315]. Then, the kernel need only be evaluated at each 
nonempty bin center, with density contributions weighted by bin counts. Drastic re- 
ductions in computing time can thereby be obtained in situations where n is so large 
as to prevent calculating individual contributions to f centered at every Xj. 


10.2.1 Choice of Bandwidth 


The bandwidth parameter controls the smoothness of the density estimate. Recall 
from (10.4) and (10.5) that MISE(/) equals the integrated mean squared error. This 
emphasizes that the bandwidth determines the trade-off between the bias and variance 
of f. Such a trade-off is a pervasive theme in nearly all kinds of model selection, 
including regression, density estimation, and smoothing (see Chapters 11 and 12). 
A small bandwidth produces a density estimator with wiggles indicative of high 
variability caused by undersmoothing. A large bandwidth causes important features 
of f to be smoothed away, thereby causing bias. 


Example 10.1 (Bimodal Density) The effect of bandwidth is shown in Figure 10.2. 
This histogram shows a sample of 100 points from an equally weighted mixture of 
N(4, 1°) and N(9, 27) densities. Three density estimates that use a standard normal 
kernel are superimposed, with h = 1.875 (dashed), h = 0.625 (heavy), and h = 0.3 
(solid). The bandwidth h = 1.875 is clearly too large because it leads to an over- 
smooth density estimate that fails to reveal the bimodality of f. On the other hand, 
h = 0.3 is too small a bandwidth, leading to undersmoothing. The density estimate is 
too wiggly, exhibiting many false modes. The bandwidth h = 0.625 is adequate, 
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correctly representing the main features of f while suppressing most effects of 
sampling variability. 


In the following subsections we discuss a variety of ways to choose h. When 
density estimation is used mainly for exploratory data analysis, span choice based on 
visual inspection is defensible, and the trial-and-error process leading to your choice 
might itself provide useful insight into the stability of features observed in the density 
estimate. In practice, one may simply try a sequence of values for h, and choose a 
value that is large enough to avoid the threshold where smaller bandwidths cause 
features of the density estimate to become unstable or the density estimate exhibits 
obvious wiggles so localized that they are unlikely to represent modes of f. Although 
the density estimate is sensitive to the choice of bandwidth, we stress that there is no 
single correct choice in any application. Indeed, bandwidths within 10-20% of each 
other will often produce qualitatively similar results. 

There are situations when a more formal bandwidth selection procedure might 
be desired: for use in automatic algorithms, for use by a novice data analyst, or for 
use when a greater degree of objectivity or formalism is desired. A comprehensive 
review of approaches is given in [360]; other good reviews include [32, 88, 359, 500, 
581, 592, 598]. 

To understand bandwidth selection, it is necessary to further analyze MISE(h). 
Suppose that K is a symmetric, continuous probability density function with mean 
zero and variance 0 < ot < oo. Let R(g) denote a measure of the roughness of a 
given function g, defined by 


R(g) = J g2) dz. (10.7) 


Assume hereafter that R(K) < œ and that f is sufficiently smooth. In this section, 
this means that f must have two bounded continuous derivatives and R( f”) < co; 
higher-order smooth derivatives are required for some methods discussed later. Recall 
that 


MISE(h) = | MSE, ( f(x)) dx = i. var{ f(x)} + (bias{ fœ)})7 dx. (10.8) 


We further analyze this expression, allowing h > 0 and nh > œ as n > ov. 
To compute the bias term in (10.8), note that 


i 1 x—u 
etfool= 5 fk h ) oau 


= / K(t) f(x — ht) dt (10.9) 


by applying a change of variable. Next, substituting the Taylor series expansion 


f(x ht) = f(x) — htf! (x) + se f" (x) + 0(h’) (10.10) 
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in (10.9) and noting that K is symmetric about zero leads to 
A 1 2 2 2 
EIOH = f+ zh of œ) + Hr), (10.11) 
where o(h?) is a quantity that converges to zero faster than h? does as h — 0. Thus, 
A 1 
(bias{ KO = z” okl f"@OP +h’), (10.12) 


and integrating the square of this quantity over x gives 


/ (bias { fwy dx = ntok Rt f) + hA). (10.13) 


To compute the variance term in (10.8), a similar strategy is employed: 
1 1 x — Xi 
var K 
n h h 
à [Kon Pe E ae Í 
= — x 
nh n h h 


1 1 
= KEP [f(@) + 0()] dt- [f + 00)? 
nh n 


var{ f(x)} 


1 1 
= — f(x)R(K)+0| — |. (10.14) 
nh nh 
Integrating this quantity over x gives 
A R(K) 1 
fraton dx = ——+0|— ]. (10.15) 
nh nh 
Thus, 
1 
MISE(h) = AMISE(h) + 0 (| + i) ; (10.16) 
n 
where 


4—4 1 
AMISE(h) = ne ee KEG ) (10.17) 


is termed the asymptotic mean integrated squared error. If nh —> oo and h > 0 
as n — œ, then MISE(h) — 0, confirming our intuition from the uniform kernel 
estimator discussed in the chapter introduction. The error term in (10.16) can be 
shown to equal O(n—! + h’) with a more delicate analysis of the squared bias as in 
[580], but it is the AMISE that interests us most. 

To minimize AMISE(/) with respect to h, we must set h at an intermediate value 
that avoids excessive bias and excessive variability in Ô . Minimizing AMISE(A) with 
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respect to h shows that exactly balancing the orders of the bias and variance terms in 
(10.17) is best. The optimal bandwidth is 


h= (eo) (10.18) 
~ \nof RUF) í f 


but this result is not immediately helpful since it depends on the unknown density f. 

Note that optimal bandwidths have h = O(n-!/5), in which case MISE = 
O(n-*/>). This result reveals how quickly the bandwidth should shrink with increas- 
ing sample size, but it says little about what bandwidth would be appropriate for 
density estimation with a given dataset. A variety of automated bandwidth selection 
strategies have been proposed; see the following subsections. Their relative perfor- 
mance in real applications varies with the nature of f and the observed data. There 
is no universally best way to proceed. 

Many bandwidth selection methods rely on optimizing or finding the root of 
a function of h—for example, minimizing an approximation to AMISE(h). In these 
cases, a search may be conducted over a logarithmic-spaced grid of 50 or more 
values, linearly interpolating between grid points. When there are multiple roots or 
local minima, a grid search permits a better understanding of the bandwidth selection 
problem than would an automated optimization or root-finding algorithm. 


10.2.1.1 Cross-Validation Many bandwidth selection strategies begin by relat- 
ing h to some measure of the quality of f as an estimator of f. The quality is quantified 
by some Q(h), whose estimate, O(h), is optimized to find h. 

If O(h) evaluates the quality of f based on how well it fits the observed data in 
some sense, then the observed data are being used twice: once to calculate f from the 
data and a second time to evaluate the quality of f as an estimator of f. Such double 
use of the data provides an overoptimistic view of the quality of the estimator. When 
misled in this way, the chosen estimator tends to be overfitted (i.e., undersmoothed), 
with too many wiggles or false modes. 

Cross-validation provides a remedy to this problem. To evaluate the quality of 
f at the ith data point, the model is fitted using all the data except the ith point. Let 


A 1 Xi— Xj 

f(X) = SOK ( 7 ) (10.19) 
denote the estimated density at X; using a kernel density estimator with all the obser- 
vations except X;. Choosing Ô to be a function of the f_;(X;) separates the tasks of 
fitting and evaluating Ô to select h. 

Although cross-validation enjoys great success as a span selection strategy for 
scatterplot smoothing (see Chapter 11), it is not always effective for bandwidth se- 
lection in density estimation. The h estimated by cross-validation approaches can 
be highly sensitive to sampling variability. Despite the persistence of these meth- 
ods in general practice and in some software, a sophisticated plug-in method like 
the Sheather-Jones approach (Section 10.2.1.2) is a much more reliable choice. 
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Nevertheless, cross-validation methods introduce some ideas that are useful in a 
variety of contexts. 
One easy cross-validation option is to let Q(h) be the pseudo-likelihood (PL) 


PL) = [| 4X», (10.20) 
i=l 


as proposed in [171, 289]. The bandwidth is chosen to maximize the pseudo- 
likelihood. Although simple and intuitively appealing, this approach frequently pro- 
duces kernel density estimates that are too wiggly and too sensitive to outliers [582]. 
The theoretical limiting performance of kernel density estimators with span chosen 
to minimize PL(h) is also poor: In many cases the estimator is not consistent [578]. 

Another approach is motivated by reexpressing the integrated squared error as 


ISE(h) = J Fœ dx — 2E{ f} + J f(x) dx 
= R(f) — 2E{f(x)} + R(f). (10.21) 


The final term in this expression is constant, and the middle term can be estimated by 
(2/n) X; f-i(X;). Thus, minimizing 


An, ~Dee es 
UCV(A) = R(f) — F >D f-i(Xi) (10.22) 
i=l 


with respect to h should provide a good bandwidth [56, 561]. UCV(h) is called 
the unbiased cross-validation criterion because E{UCV(h) + R(f)} = MISE(h). The 
approach is also called least squares cross-validation because choosing h to minimize 
UCV(h) minimizes the integrated squared error between Ô and f. 

If analytic evaluation of R( f ) is not possible, the best way to evaluate (10.22) is 
probably to use a different kernel that permits an analytic simplification. For a normal 
kernel ¢ it can be shown that 


ucv) = O 
nh 
_ 1 = 1 1⁄2 ( Xi= Xj X; — Xj 
tanh > 2 Fee ( h ) 20( h )| , (10.23) 


i=l fet 


following the steps outlined in Problem 10.3. This expression can be computed effi- 
ciently without numerical approximation. 

Although the bandwidth identified by minimizing UCV(h) with respect to h is 
asymptotically as good as the best possible bandwidth [293, 614], its convergence 
to the optimum is extremely slow [298, 583]. In practical settings, using unbiased 
cross-validation is risky because the resulting bandwidth tends to exhibit a strong 
dependence on the observed data. In other words, when applied to different datasets 
drawn from the same distribution, unbiased cross-validation can yield very different 
answers. Its performance is erratic in application, and undersmoothing is frequently 
seen. 
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Hours Since Midnight, April 5 
FIGURE 10.3 Histogram of the times of 121 bowhead whale calf sightings during the 2001 
spring migration discussed in Example 10.2. The date of each sighting is expressed as the 
number of hours since midnight, April 5, when the first adult whale was sighted. 


The high sampling variability of unbiased cross-validation is mainly attributable 
to the fact that the target performance criterion, Q(h) = ISE(h), is itself random, 
unlike MISE(/). Scott and Terrell have proposed a biased cross-validation criterion 
[BCV(h)] that seeks to minimize an estimate of AMISE(h) [583]. In practice, this 
approach is generally outperformed by the best plug-in methods (Section 10.2.1.2) 
and can yield excessively large bandwidths and oversmooth density estimates. 


Example 10.2 (Whale Migration) Figure 10.3 shows a histogram of times of 
121 bowhead whale calf sightings during the spring 2001 visual census conducted 
at the ice edge near Point Barrow, Alaska. This census is the central component of 
international efforts to manage this endangered whale population while allowing a 
small traditional subsistence hunt by coastal Inupiat communities [156, 249, 528]. 

The timing of the northeasterly spring migration is surprisingly regular, and 
it is important to characterize the migration pattern for planning of future scientific 
efforts to study these animals. There is speculation that the migration may occur in 
loosely defined pulses. If so, this is important to discover because it may lead to new 
insights about bowhead whale biology and stock structure. 

Figure 10.4 shows the results of kernel density estimates for these data using 
the normal kernel. Three different cross-validation criteria were used to select h. 
Maximizing cross-validated PL(h) with respect to h yields h = 9.75 and the density 
estimate shown with the dashed curve. This density estimate is barely adequate, 
exhibiting likely false modes in several regions. The result from minimizing UCV(h) 
with respect to h is even worse in this application, giving h = 5.08 and the density 
estimate shown with the dotted curve. This bandwidth is clearly too small. Finally, 
minimizing BCV(h) with respect to h yields h = 26.52 and the density estimate 
shown with the solid line. Clearly the best of the three options, this density estimate 
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FIGURE 10.4 Kernel density estimates for the whale calf migration data in Example 10.2 us- 
ing normal kernels and bandwidths chosen by three different cross-validation criteria. The band- 


widths are 9.75 using PL(A) (dashed), 5.08 using UCV(A) (dotted), and 26.52 using BCV(h) 
(solid). 


emphasizes only the most prominent features of the data distribution but may seem 
oversmooth. Perhaps a bandwidth between 10 and 26 would be preferable. 


10.2.1.2 Plug-in Methods Plug-in methods apply a pilot bandwidth to estimate 
one or more important features of f. The bandwidth for estimating f itself is then 
estimated at a second stage using a criterion that depends on the estimated features. 
The best plug-in methods have proven to be very effective in diverse applications and 
are more popular than cross-validation approaches. However, Loader offers arguments 
against the uncritical rejection of cross-validation approaches [422]. 

For unidimensional kernel density estimation, recall that the bandwidth that 
minimizes AMISE is given by 


= ( R(K) D (10.24) 
not R( f’) i l 


where ae is the variance of K, viewing K as a density. At first glance, (10.24) seems 
unhelpful because the optimal bandwidth depends on the unknown density f through 
the roughness of its second derivative. A variety of methods have been proposed to 
estimate R( f”). 

Silverman suggests an elementary approach: replacing f by a normal density 
with variance set to match the sample variance [598]. This amounts to estimating 
R( f") by R(¢”)/6>, where ¢ is the standard normal density function. Silverman’s 


rule of thumb therefore gives 
we 
= ( — ) ô. (10.25) 


3n 
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If f is multimodal, the ratio of R( f”) to 6 may be larger than it would be for nor- 
mally distributed data. This would result in oversmoothing. A better bandwidth can 
be obtained by considering the interquartile range (IQR), which is a more robust 
measure of spread than is 6. Thus, Silverman suggests replacing ô in (10.25) by 
& = min{é, IQR/(®~!(0.75) — &—!(0.25))} ~ min{é, IQR/1.35}, where ® is the 
standard normal cumulative distribution function. Although simple, this approach 
cannot be recommended for general use because it has a strong tendency to over- 
smooth. Silverman’s rule of thumb is valuable, however, as a method for producing 
approximate bandwidths effective for pilot estimation of quantities used in sophisti- 
cated plug-in methods. 

Empirical estimation of R(f”) in (10.24) is a better option than Silverman’s 
rule of thumb. The kernel-based estimator is 


[i <2 x— Xi 
=—,Y L’ i), 10.26 
an a 


where ho is the bandwidth and L is a sufficiently differentiable kernel used to estimate 
f". Estimation of R( f”) follows from (10.26). 

It is important to recognize, however, that the best bandwidth for estimating f 
will differ from the best bandwidth for estimating f” or R( f”). This is because var{ f”} 
contributes a proportionally greater share to the mean squared error for estimating 
f" than var{ f} does for estimating f. Therefore, a larger bandwidth is required for 
estimating f”. We therefore anticipate ho > h. 

Suppose we use bandwidth ho with kernel L to estimate R( f”), and bandwidth 
h with kernel K to estimate f. Then the asymptotic mean squared error for estimation 
of R( f”) using kernel L is minimized when ho œ n—1/7_ To determine how ho should 


be related to h, recall that optimal bandwidths for estimating f have h « n7!/>, 
Solving this expression for n and replacing n in the equation ho œ n~!/7, one can 
show that 

ho = CRP"), RCAL), (10.27) 


where Cı and C2 are functionals that depend on derivatives of f and on the kernel L, 
respectively. Equation (10.27) still depends on the unknown f, but the quality of the 
estimate of R( f”) produced using hg and L is not excessively deteriorated if hg is set 
using relatively simple estimates to find Cı and C32. In fact, we may estimate Cı and 
C2 using a bandwidth chosen by Silverman’s rule of thumb. 

The result is a two-stage process for finding the bandwidth, known as the 
Sheather—Jones method (359, 593]. At the first stage, a simple rule of thumb is used 
to calculate the bandwidth ho. This bandwidth is used to estimate R( f”), which is the 
only unknown in expression (10.24) for the optimal bandwidth. Then the bandwidth 
h is computed via (10.24) and is used to produce the final kernel density estimate. 
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For univariate kernel density estimation with pilot kernel L = ¢, the Sheather— 
Jones bandwidth is the value of h that solves the equation 


RK) 1/5 7 
T5 h=0, (10.28) 
nox Ran ( f") 


where 


7 7 1 ae Xi— Xj 
ad pgp? (= i) ; 


i=l j=1 


iy = (R m 


Z ed 1 ae X;— Xj; 
OVS Saag ee (=~). 


tsi j=1 
x 1 eae X;— Xj; 
R "h — (6) l ri : 
AIE ET > ae b 
0.9207.QR) 
GEET E AT 
0.912(IQR) 
b= n 


where $” is the ith derivative of the normal density function, and IQR is the in- 
terquartile range of the data. The solution to (10.28) can be found using grid search 
or a root-finding technique from Chapter 2, such as Newton’s method. 

The Sheather—Jones method generally performs extremely well [359, 360, 501, 
592]. There are a variety of other good methods based on carefully chosen approxi- 
mations to MISE(h) or its minimizer [88, 300, 301, 358, 500]. In each case, careful 
pilot estimation of various quantities plays a critical role in ensuring that the final 
bandwidth performs well. Some of these approaches give bandwidths that asymptot- 
ically converge more quickly to the optimal bandwidth than does the Sheather—Jones 
method; all can be useful options in some circumstances. However, none of these offer 
substantially easier practical implementation or broadly better performance than the 
Sheather—Jones approach. 


Example 10.3 (Whale Migration, Continued) Figure 10.5 illustrates the use of 
Silverman’s rule of thumb and the Sheather-Jones method on the bowhead whale 
migration data introduced in Example 10.2. The bandwidth given by the Sheather— 
Jones approach is 10.22, yielding the density estimate shown with the solid line. This 
bandwidth seems a bit too narrow, yielding a density estimate that is too wiggly. 
Silverman’s rule of thumb gives a bandwidth of 32.96, larger than the bandwidth 
given by any previous method. The resulting density estimate is probably too smooth, 
hiding important features of the distribution. 
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FIGURE 10.5 Kernel density estimates for the whale calf migration data using normal ker- 
nels with bandwidths chosen by three different criteria. The bandwidths are 10.22 using the 
Sheather—Jones approach (solid), 32.96 using Silverman’s rule of thumb (dashed), and 35.60 
using the maximal smoothing span of Terrell (dotted). 


10.2.1.3 Maximal Smoothing Principle Recall again that the minimal AMISE 


is obtained when 
R(K) 1/5 
nox R(f”) 


but f is unknown. Silverman’s rule of thumb replaces R( f”) by R(#”). The Sheather— 
Jones method estimates R( f”). Terrell’s maximal smoothing approach replaces R( f”) 
with the most conservative (i.e., smallest) possible value [627]. 

Specifically, Terrell considered the collection of all A that would minimize 
(10.29) for various f and recommended that the largest such bandwidth be chosen. 
In other words, the right-hand side of (10.29) should be maximized with respect to 
f. This will bias bandwidth selection against undersmoothing. Since R( f”) vanishes 
as the variance of f shrinks, the maximization is carried out subject to the constraint 
that the variance of f matches the sample variance 67. 

Constrained maximization of (10.29) with respect to f is an exercise in the 
calculus of variations. The f that maximizes (10.29) is a polynomial. Substituting its 
roughness for R( f”) in (10.29) yields 


ey 1 
h = ô (10.30) 


as the chosen bandwidth. Table 10.1 provides the values of R(K) for some common 
kernels. 

Terrell proposed the maximal smoothing principle to motivate this choice of 
bandwidth. When interpreting a density estimate, the analyst’s eye is naturally drawn 
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TABLE 10.1 Some kernel choices and related quantities discussed in the text. The kernels are listed 
in increasing order of roughness, R(K). K(z) should be multiplied by 14)7)<1; in all cases except the 
normal kernel, which has positive support over the entire real line. RE is the asymptotic relative 
efficiency, as described in Section 10.2.2.1. 


Name K(z) R(K) ôK) RE 


Normal exp{—z?/2}/ V27 1/2/77) (1/(2./x))/5 1.051 
Uniform 1 a ar 1.076 
Epanechnikov (#)a — 2’) 2 151/5 1.000 
Triangle 1— |z] A 24!/5 1.014 
Biweight (R)a-2y 3 351/5 1.006 
Triweight (8)a-2) 50 ea 1.013 


to modes. Further, modes usually have important scientific implications. Therefore, 
the bandwidth should be selected to discourage false modes, producing an estimate 
that shows modes only where the data indisputably require them. 

The maximal smoothing approach is appealing because it is quick and simple 
to calculate. In practice, the resulting kernel density estimate is often too smooth. We 
would be reluctant to use amaximal smoothing bandwidth when the density estimate is 
being used for inference. For exploratory analyses, the maximal smoothing bandwidth 
can be quite helpful, allowing the analyst to focus on the dominant features of the 
density without being misled by highly variable indications of possibly false modes. 


Example 10.4 (Whale Migration, Continued) The dotted line in Figure 10.5 
shows the density estimate obtained using the maximal smoothing bandwidth of 
35.60. Even larger than Silverman’s bandwidth, this choice appears too large for 
the whale data. Generally, both Silverman’s rule of thumb and Terrell’s maximal 
smoothing principle tend to produce oversmoothed density estimates. 


10.2.2 Choice of Kernel 


Kernel density estimation requires specification of two components: the kernel and 
the bandwidth. It turns out that the shape of the kernel has much less influence on the 
results than does the bandwidth. Table 10.1 lists a few choices for kernel functions. 


10.2.2.1 Epanechnikov Kernel Suppose K is limited to bounded, symmetric 
densities with finite moments and variance equal to 1. Then Epanechnikov showed that 
minimizing AMISE with respect to K amounts to minimizing R(K) with respect to 
K subject to these constraints [186]. The solution to this variational calculus problem 
is the kernel assigning density ZK *(z/4/5), where K* is the Epanechnikov kernel 


3 2a; 
BOS el (10.31) 
0 otherwise. 


This is a symmetric quadratic function, centered at zero, where its mode is reached, 
and decreasing to zero at the limits of its support. 
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From (10.17) and (10.18) we see that the minimal AMISE for a kernel density 
estimator with positive kernel K is 3 [ox R(K)/n]*” R(f")'/>. Switching to a K 
that doubles ox R(K) therefore requires doubling n to maintain the same minimal 
AMISE. Thus, ox, R(K2)/(oxK, R(K1)) measures the asymptotic relative efficiency of 
K2 compared to Kı. The relative efficiencies of a variety of kernels compared to the 
Epanechnikov kernel are given in Table 10.1. Notice that the relative efficiencies are 
all quite close to 1, reinforcing the point that kernel choice is fairly unimportant. 


10.2.2.2 Canonical Kernels and Rescalings Unfortunately, a particular value 
of h corresponds to a different amount of smoothing depending on which kernel is 
being used. For example, h = 1 corresponds to a kernel standard deviation nine times 
larger for the normal kernel than for the triweight kernel. 

Let hx and hz denote the bandwidths that minimize AMISE(h) when using 
symmetric kernel densities K and L, respectively, which have mean zero and finite 
positive variance. Then from (10.29) it is clear that 


hx = ey (10.32) 
hy dL) 
where for any kernel we have 6(K) = (R(K )/ of) ve a Thus, to change from bandwidth 
h for kernel K to a bandwidth that gives an equivalent amount of smoothing for kernel 
L, use the bandwidth hé(L)/6(K). Table 10.1 lists values for 6(K) for some common 
kernels. 

Suppose further that we rescale each kernel shape in Table 10.1 so that h = 
1 corresponds to a bandwidth of 6(K). The kernel density estimator can then be 
written as 


4 lJ 
Fx@) =~ 5 Knscx)(x — Xi), 


i=1 


where 


1 Z 
Knak)(Z) = nmo E (m5) , 


and K represents one of the original kernel shapes and scalings shown in Table 10.1. 
Scaling kernels in this way provides a canonical kernel Ksg) of each shape [440]. A 
key benefit of this viewpoint is that a single value of h can be used interchangeably 
for each canonical kernel without affecting the amount of smoothing in the density 
estimate. 

Note that 


(10.33) 


1 MRS”) 
AMISE(h) = C(KsK)) (= + EN) 


4 


for an estimator using a canonical kernel with bandwidth h [i.e., a kernel from 
Table 10.1 with bandwidth 45(K)] and with C(Kx)) = (ox R(K))*’>. This means 
that the balance between variance and squared bias determined by the factor 
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FIGURE 10.6 Kernel density estimates for the data in Example 10.1 using the canonical 
version of each of the six kernels from Table 10.1, with h = 0.69 (dotted). 


(nh)! + h* Rf ")/4 is no longer confounded with the chosen kernel. It also means 
that the contributions made by the kernel to the variance and squared bias terms in 
AMISE(h) are equal. It follows that the optimal kernel shape does not depend on 
the bandwidth: The Epanechnikov kernel shape is optimal for any desired degree of 
smoothing [440]. 


Example 10.5 (Bimodal Density, Continued) Figure 10.6 shows kernel density 
estimates for the data from Example 10.1, which originated from the equally weighted 
mixture of N(4, 17) and N(9, 2”) densities. All the bandwidths were set at 0.69 for 
the canonical kernels of each shape, that being the Sheather-Jones bandwidth for the 
normal kernel. The uniform kernel produces a noticeably rougher result due to its 
discontinuity. The Epanechnikov and uniform kernels provide a slight (false) sug- 
gestion that the lower mode contains two small local modes. Aside from these small 
differences, the results for all the kernels are qualitatively the same. This example 
illustrates that even quite different kernels can be scaled to produce such similar 
results that the choice of kernel is unimportant. 


10.3 NONKERNEL METHODS 


10.3.1 Logspline 


A cubic spline is a piecewise cubic function that is everywhere twice continuously 
differentiable but whose third derivative may be discontinuous at a finite number of 
prespecified knots. One may view a cubic spline as a function created from cubic 
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polynomials on each between-knot interval by pasting them together twice continu- 
ously differentiably at the knots. Kooperberg and Stone’s logspline density estimation 
approach estimates the log of f by a cubic spline of a certain form [389, 615]. 

This method provides univariate density estimation on an interval (L, U), where 
each endpoint may be infinite. Suppose there are M > 3 knots, t; for j = 1,..., M, 
with L < fi < h <--- < ty < U. Knot selection will be discussed later. 

Let S denote the M-dimensional space consisting of cubic splines with knots at 
t1, ..., tm and that are linear on (L, tı] and [tm, U). Let a basis for S be denoted by 
the functions {1, B,,..., Bm-1}. There are numerical advantages to certain types 
of bases; books on splines and the other references in this section provide additional 
detail [144, 577]. It is possible to choose the basis functions so that on (L, tı] the 
function B4 is linear with a negative slope and all other B; are constant, and so that on 
[tm, U) the function Byy_ is linear with a positive slope and all other B; are constant. 

Now consider modeling f with a parameterized density, fxjọ, defined by 


log fxjo(x|0) = 0 Bi(x) +--+ + 8m1 Bm- — c0), (10.34) 
where 
U 
exp{c(0)} = f exp{0; By(x) +--- + 0y—-1By_1(x)} dx (10.35) 
L 
and 0 = (6),...,@y_-1). For this to be a reasonable model for a density, we 


require c(@) to be finite, which is ensured if (i) L > —oo or 0; < 0 and (ii) U < œ 
or Oy—1 < 0. Under this model, the log likelihood of 6 is 


n 
Olx,- xn) = X log fryo(xil), (10.36) 
i=1 

given observed data values x1, ..., Xn. As long as the knots are positioned so that each 
interval contains sufficiently many observations for estimation, maximizing (10.36) 
subject to the constraint that c(@) is finite provides the maximum likelihood estimate, 
6. This estimate is unique because /(0@|x1,..., Xn) is concave. Having estimated the 
model parameters, take 


f(x) = frio(x14) (10.37) 


as the maximum likelihood logspline density estimate of f(x). 

The maximum likelihood estimation of is conditional on the number of knots 
and their placement. Kooperberg and Stone suggest an automated strategy for place- 
ment of a prespecified number of knots [390]. Their strategy places knots at the small- 
est and largest observed data point and at other positions symmetrically distributed 
about the median but not equally spaced. 

To place a prespecified number of knots, let xœ) denote the ith order statis- 
tic of the data, for i = 1, ...,n, so x(1) is the minimum observed value. Define an 


approximate quantile function 
i—l 
= Xi 
a|- © 
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for 1 <i<n, where the value of q is obtained by linear interpolation for 
noninteger i. 
The M knots will be placed at x(1), at x(n), and at the positions of the order 
statistics indexed by q(r2),..., a(r y—1) for a sequence of numbers 0 < r2 < r3 < 
-<Pry_ <1. 
When (L, U) = (—00, 00), placement of the interior knots is governed by the 
following constraint on between-knot distances: 


n(ri+1 — ri) = 4- max{4 — e, 1} - max{4 — 2e, 1}---max{4— (i — Le, 1} 


for 1 < i < M/2, where rı = 0 and € is chosen to satisfy TM+y/2 = 5 if M is odd, or 
Typ + TMH = 1 if M is even. The remaining knots are placed to maintain quantile 
symmetry, so that 


FM+1—-i ` "M-i = i411 — Si (10.38) 


for M/2 <i < M — 1, where ry = 1. 

When (L, U) is not a doubly infinite interval, similar knot placement rules have 
been suggested. In particular, if (L, U) is a interval of finite length, thenr2,..., F y—1 
are chosen equally spaced, so r; = (i — 1)/(M — 1). 

The preceding paragraphs assumed that M, the number of knots, was prespec- 
ified. A variety of methods for choosing M are possible, but methods for choosing 
the number of knots have evolved to the point where a complete description of the 
recommended strategy is beyond the scope of our discussion. Roughly, the process is 
as follows. Begin by placing a small number of knots in the positions given above. The 
suggested minimum number is the first integer exceeding min{2.5n!/>, n/4, n*, 25}, 
where n* is the number of distinct data points. Additional knots are then added to 
the existing set, one at time. At each iteration, a single knot is added in a position 
that gives the largest value of the Rao statistic for testing that the model without that 
knot suffices [391, 615]. Without examining significance levels, this process contin- 
ues until the total number of knots reaches min{4n!/ 5n /4, n*, 30}, or until no new 
candidate knots can be placed due to constraints on the positions or nearness of knots. 

Next, single knots are sequentially deleted. The deletion of a single knot cor- 
responds to the removal of one basis function. Let 6= (61, ne) M—1) denote the 
maximum likelihood estimate of the parameters of the current model. Then the Wald 
statistic for testing the significance of the contribution of the ith basis function is 
6; /SE{4;}, where SE{6;} is the square root of the ith diagonal entry of (ÂL, the 
inverted observed information matrix [391, 615]. The knot whose removal would 
yield the smallest Wald statistic value is deleted. Sequential deletion is continued 
until only about three knots remain. 

Sequential knot addition followed by sequential knot deletion generates a se- 
quence of S models, with varying numbers of knots. Denote the number of knots in 
the sth model by ms, for s = 1, ..., S. To choose the best model in the sequence, let 


BIC(s) = —2U(,|x1, ...,Xn) + (ms — 1) logn (10.39) 


measure the quality of the sth model having corresponding MLE parameter 
vector 0s. The quantity BIC(s) is a Bayes information criterion for model 
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FIGURE 10.7 Logspline density estimate (solid line) from bowhead whale calf migration 
data in Example 10.6. Below the histogram are dots indicating where knots were used (solid) 
and where knots were considered but rejected (hollow). Two other logspline density estimates 
for other knot choices are shown with the dotted and dashed lines; see the text for details. 


comparison [365, 579]; other measures of model quality can also be motivated. 
Selection of the model with the minimal value of BIC(s) among those in the model 
sequence provides the chosen number of knots. 

Additional details of the knot selection process are given in [391, 615]. Soft- 
ware to carry out logspline density estimation in the R language is available in [388]. 
Stepwise addition and deletion of knots is a greedy search strategy that is not guar- 
anteed to find the best collection of knots. Other search strategies are also effective, 
including MCMC strategies [305, 615]. 

The logspline approach is one of several effective methods for density estima- 
tion based on spline approximation; another is given in [285]. 


Example 10.6 (Whale Migration, Continued) Figure 10.7 shows the logspline 
density estimate (solid line) for the whale calf migration data from Example 10.2. 
Using the procedure outlined above, a model with seven knots was selected. The 
locations of these seven knots are shown with the solid dots in the figure. During 
initial knot placement, stepwise knot addition, and stepwise knot deletion, four other 
knots were considered at various stages but not used in the model finally selected 
according to the BIC criterion. These discarded knots are shown with hollow dots 
in the figure. The degree of smoothness seen in Figure 10.7 is typical of logspline 
estimates since splines are piecewise cubic and twice continuously differentiable. 
Estimation of local modes can sometimes be a problem if the knots are insuf- 
ficient in number or poorly placed. The other lines in Figure 10.7 show the logspline 
density estimates with two other choices for the knots. The very poor estimate (dashed 
line) was obtained using 6 knots. The other estimate (dotted line) was obtained using 
all 11 knots shown in the figure with either hollow or solid dots. 
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10.4 MULTIVARIATE METHODS 


Multivariate density estimation of a density function f is based on i.i.d. random 
variables sampled from f. A p-dimensional variable is denoted X; = (Xj1,..., Xip). 


10.4.1 The Nature of the Problem 


Multivariate density estimation is a significantly different task than univariate density 
estimation. It is very difficult to visualize any resulting density estimate when the 
region of support spans more than two or three dimensions. As an exploratory data 
analysis tool, multivariate density estimation therefore has diminishing usefulness 
unless some dimension-reduction strategy is employed. However, multivariate den- 
sity estimation is a useful component in many more elaborate statistical computing 
algorithms, where visualization of the estimate is not required. 

Multivariate density estimation is also hindered by the curse of dimension- 
ality. High-dimensional space is very different than 1, 2, or 3-dimensional space. 
Loosely speaking, high-dimensional space is vast, and points lying in such a space 
have very few near neighbors. To illustrate, Scott defined the tail region of a standard 
p-dimensional normal density to comprise all points at which the probability density 
is less than one percent of the density at the mode [581]. While only 0.2% of the 
probability density falls in this tail region when p = 1, more than half the density 
falls in it when p = 10, and 98% falls in it when p = 20. 

The curse of dimensionality has important implications for density estimation. 
For example, consider a kernel density estimator based on arandom sample of n points 
whose distribution is p-dimensional standard normal. Below we mention several 
ways to construct such an estimator; our choice here is the so-called product kernel 
approach with normal kernels sharing a common bandwidth, but it is not necessary 
to understand this technique yet to follow our argument. Define the optimal relative 
root mean squared error (ORRMSE) at the origin to be 


min, {MSE; (/(0)) } 
fO) f 


ORRMSE(p, n) = 


where Ô estimates f from a sample of n points using the best possible bandwidth. 
This quantity measures the quality of the multivariate density estimator at the true 
mode. When p = 1 and n = 30, ORRMSE(1, 30) = 0.1701. Table 10.2 shows the 
sample sizes required to achieve as low a value of ORRMSE(p, n) for different values 
of p. The sample sizes are shown to about three significant digits. A different band- 
width minimizes ORRMSE( p, n) for each different p and n, so the table entries were 
computed by fixing p and searching over n, with each trial value of n requiring an 
optimization over h. This table confirms that desirable sample sizes grow very rapidly 
with p. In practice, things are not as hopeless as Table 10.2 might suggest. Adequate 
estimates can be sometimes obtained with a variety of techniques, especially those 
that attempt to simplify the problem via dimension reduction. 
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TABLE 10.2 Sample sizes required to match the optimal rel- 
ative root mean squared error at the origin achieved for one- 
dimensional data when n = 30. These results pertain to estima- 
tion of a p-variate normal density using a normal product kernel 
density estimator with bandwidths that minimize the relative 
root mean squared error at the origin in each instance. 


p n 
1 30 
2 180 
3 806 
5 17,400 
10 112,000,000 
15 2,190,000,000,000 


30 806,000,000,000,000,000,000,000,000 


10.4.2 Multivariate Kernel Estimators 


The most literal generalization of the univariate kernel density estimator in (10.6) 
to the case of p-dimensional density estimation is the general multivariate kernel 
estimator 


n 
WOSA K (H'a = Xi)) (10.40) 
n|H| s 

where H is a p x p nonsingular matrix of constants, whose absolute determinant is 
denoted |H]|. The function K is a real-valued multivariate kernel function for which 
J K(z)dz = 1, f zK(z) dz = 0, and f 22" K(z) dz = Iņ, where Iņ is the p x piden- 
tity matrix. 

This estimator is quite a bit more flexible than usually is required. It allows both 
a p-dimensional kernel of arbitrary shape and an arbitrary linear rotation and scaling 
via H. It can be quite inconvenient to try to specify the large number of bandwidth 
parameters contained in H and to specify a kernel shape over p-dimensional space. It 
is more practical to resort to more specialized forms of H and K that have far fewer 
parameters. 

The product kernel approach provides a great deal of simplification. The density 
estimator is 


f= SS saa: (7#) (10.41) 
~ n&tlh; hj l 
i=1 j=1 ; 
where K(z) is a univariate kernel function, x = (x1, .. . , Xp), Xi = (Xi1,..., Xip), 
and the A j are fixed bandwidths for each coordinate, j = 1,..., p. 
Another simplifying approach would be to allow K to be a radially symmetric, 
unimodal density function in p dimensions, and to set 


vy = Lyk (22% 10.42 
TO= pp 2 h : (10.42) 
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In this case, the multivariate Epanechnikov kernel shape 


(p + 2)PC + p/2) Ta <¢,T 
K@ = 2nPP en Ns (10.43) 
0 otherwise 


is optimal with respect to the asymptotic mean integrated squared error. However, as 
with univariate kernel density estimation, many other kernels provide nearly equiva- 
lent results. 

The single fixed bandwidth in (10.42) means that probability contributions 
associated with each observed data point will diffuse in all directions equally. When 
the data have very different variability in different directions, or when the data nearly 
lie on a lower-dimensional manifold, treating all dimensions as if they were on the 
same scale can lead to poor estimates. Fukunaga [211] suggested linearly transforming 
the data so that they have identity covariance matrix, then estimating the density of 
the transformed data using (10.42) with a radially symmetric kernel, and then back- 
transforming to obtain the final estimate. To carry out the transformation, compute 
an eigenvalue—eigenvector decomposition of the sample covariance matrix so y= 
PAPT, where A isa p x p diagonal matrix with the eigenvalues in descending order 
and P is an orthonormal p x p matrix whose columns consist of the eigenvectors 
corresponding to the eigenvalues in A. Let X be the sample mean. Then setting 
Z; = ATPT; — X) fori = 1, .. . , provides the transformed data. This process 
is commonly called whitening or sphering the data. Using the kernel density estimator 
in (10.42) on the transformed data is equivalent to using the density estimator 


B12 0- XTE a- XD 
T 5K 7 (10.44) 


i= 


on the original data, for a symmetric kernel K. 

Within the range of complexity presented by the choices above, the product 
kernel approach in (10.41) is usually preferred to (10.42) and (10.44), in view of 
its performance and flexibility. Using a product kernel also simplifies the numerical 
calculation and scaling of kernels. 

As with the univariate case, it is possible to derive an expression for the asymp- 
totic mean integrated squared error for a product kernel density estimator. The min- 
imizing bandwidths h1, ..., hp are the solutions to a set of p nonlinear equations. 
The optimal h; are all O(n~!/t%), and AMISE(M1,..., hp) = O(n !/?*4) for 
these optimal h;. Bandwidth selection for product kernel density estimators and other 
multivariate approaches is far less well studied than in the univariate case. 

Perhaps the simplest approach to bandwidth selection in this case is to assume 
that f is normal, thereby simplifying the minimization of AMISE(h1, ..., hp) with 
respect to h;,..., Ap. This provides a bandwidth selection rationale akin to Silver- 
man’s rule of thumb in the univariate case. The resulting bandwidths for the normal 
product kernel approach are 


4 1/(p+4) 
hi = (=) ôi (10.45) 
n(p +2) 
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for i=1,..., p, where G is an estimate of the standard deviation along the ith 
coordinate. As with the univariate case, using a robust scale estimator can improve 
performance. When nonnormal kernels are being used, the bandwidth for the nor- 
mal kernel can be rescaled using (10.32) and Table 10.1 to provide an analogous 
bandwidth for the chosen kernel. 

Terrell’s maximal smoothing principle can also be applied to p-dimensional 
problems. Suppose we apply the general kernel density estimator given by (10.40) 
with a kernel function that is a density with identity covariance matrix. Then the 
maximal smoothing principle indicates choosing a bandwidth matrix H that satisfies 


2/(p+4) 
r_ [pD Re] 


HH = |i én(p + DP (p +8)/2) i oe 


where © is the sample covariance matrix. One could apply this result to find the 
maximal smoothing bandwidths for a normal product kernel, then rescale the coor- 
dinatewise bandwidths using (10.32) and Table 10.1 if another product kernel shape 
was desired. 

Cross-validation methods can also be generalized to the multivariate case, as 
can some other automatic bandwidth selection procedures. However, the overall per- 
formance of such methods in general p-dimensional problems is not thoroughly 
documented. 


10.4.3 Adaptive Kernels and Nearest Neighbors 


With ordinary fixed-kernel density estimation, the shape of K and the bandwidth are 
fixed. These determine an unchanging notion of proximity. Weighted contributions 
from nearby X; determine f(x), where the weights are based on the proximity of X; 
to x. For example, with a uniform kernel, the estimate is based on variable numbers 
of observations within a sliding window of fixed shape. 

It is worthwhile to consider the opposite viewpoint: allowing regions to vary 
in size, but requiring them (in some sense) to have a fixed number of observations 
falling in them. Then regions of larger size correspond to areas of low density, and 
regions of small size correspond to areas of high density. 

It turns out that estimators derived from this principle can be written as kernel 
estimators with a changing bandwidth that adapts to the local density of observed data 
points. Such approaches are variously termed adaptive kernel estimators, variable- 
bandwidth kernel estimators, or variable-kernel estimators. Three particular strategies 
are reviewed below. 

The motivation for adaptive methods is that a fixed bandwidth may not be 
equally suitable everywhere. In regions where data are sparse, wider bandwidths can 
help prevent excessive local sensitivity to outliers. Conversely, where data are abun- 
dant, narrower bandwidths can prevent bias introduced by oversmoothing. Consider 
again the kernel density estimate of bowhead whale calf migration times given in 
Figure 10.5 using the fixed Sheather-Jones bandwidth. For migration times below 
1200 and above 1270 hours, the estimate exhibits a number of modes, yet it is unclear 
how many of these are true and how many are artifacts of sampling variability. It is not 
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possible to increase the bandwidth sufficiently to smooth away some of the small local 
modes in the tails without also smoothing away the prominent bimodality between 
1200 and 1270. Only local changes to the bandwidth will permit such improvements. 

In theory, when p = 1 there is little to recommend adaptive methods over sim- 
pler approaches, but in practice some adaptive methods have been demonstrated to be 
quite effective in some examples. For moderate or large p, theoretical analysis sug- 
gests that the performance of adaptive methods can be excellent compared to standard 
kernel estimators, but the practical performance of adaptive approaches in such cases 
is not thoroughly understood. Some performance comparisons for adaptive methods 
can be found in [356, 581, 628]. 


10.4.3.1 Nearest Neighbor Approaches The kth nearest neighbor density 
estimator, 


f(x) = ——_., (10.47) 


was the first approach to explicitly adopt a variable-bandwidth viewpoint [423]. For 
this estimator, d(x) is the Euclidean distance from x to the kth nearest observed 
data point, and V, is the volume of the unit sphere in p dimensions, where p is the 
dimensionality of the data. Since V, = n?/? /T(p/2 + 1), note that d(x) is the only 
random quantity in (10.47), as it depends on Xj, ..., Xn. Conceptually, the kth nearest 
neighbor estimate of the density at x is k/n divided by the volume of the smallest 
sphere centered at x that contains k of the n observed data values. The number of 
nearest neighbors, k, plays a role analogous to that of bandwidth: Large values of k 
yield smooth estimates, and small values of k yield wiggly estimates. 

The estimator (10.47) may be viewed as a kernel estimator with a bandwidth 
that varies with x and a kernel equal to the density function that is uniform on the 
unit sphere in p dimensions. For an arbitrary kernel, the nearest neighbor estimator 
can be written as 


z 1 Š x — X; 

[o> ap > K ( re ) ; (10.48) 
If d(x) is replaced with an arbitrary function g(x), which may not explicitly represent 
distance, the name balloon estimator has been suggested because the bandwidth 
inflates or deflates through a function that depends on x [628]. The nearest neighbor 
estimator is asymptotically of this type: For example, using d(x) as the bandwidth for 
the uniform-kernel nearest neighbor estimator is asymptotically equivalent to using 
a balloon estimator bandwidth of hg(x) = [k/(n Vp f(x))]!/”, since 


ni 
asn —> œ, k > œ, and k/n > 0. 

Nearest neighbor and balloon estimators exhibit a number of surprising 
attributes. First, choosing K to be a density does not ensure that Î is a density; for 
instance, the estimator in (10.47) does not have a finite integral. Second, when p = 1 
and K is a density with zero mean and unit variance, choosing h(x) = k/[2nf(x)] 
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does not offer any asymptotic improvement relative to a standard kernel estimator, 
regardless of the choice of k [581]. Finally, one can show that the pointwise asymptotic 
mean squared error of a univariate balloon estimator is minimized when 


Foyt) n3 
nf"(x) l 


Even with this optimal pointwise adaptive bandwidth, however, the asymptotic 
efficiency of univariate balloon estimators does not greatly exceed that of ordinary 
fixed-bandwidth kernel estimators when f is roughly symmetric and unimodal [628]. 
Thus, nearest neighbor and balloon estimators seem a poor choice when p = 1. 

On the other hand, for multivariate data, balloon estimators offer much more 
promise. The asymptotic efficiency of the balloon estimator can greatly surpass that 
of a standard multivariate kernel estimator, even for fairly small p and symmetric, 
unimodal data [628]. If we further generalize (10.48) to 


h(x) = h(x) = ( 


1 


LTA 


5 K (Howe 2 Xi)) (10.49) 


i=1 


where H(x) is a bandwidth matrix that varies with x, then we have effectively allowed 
the shape of kernel contributions to vary with x. When H(x) = hx (x)I, the general 
form reverts to the balloon estimator. Further, setting g(x) = d(x) yields the nearest 
neighbor estimator in (10.48). More general choices for H(x) are mentioned in [628]. 


10.4.3.2 Variable-Kernel Approaches and Transformations A variable- 
kernel or sample point adaptive estimator can be written as 


7 rol x —X; 
fix) = => ok ( i, ). (10.50) 


where K is a multivariate kernel and h; is a bandwidth individualized for the kernel 
contribution centered at X; [66]. For example, h; might be set equal to the distance 
from X; to the kth nearest other observed data point, so h; = d,(X;). A more general 
variable kernel estimator with bandwidth matrix H; that depends on the ith sampled 
point is also possible [cf. Equation (10.49)], but we focus on the simpler form here. 

The variable-kernel estimator in (10.50) is a mixture of kernels with identical 
shape but different scales, centered at each observation. Letting the bandwidth vary as 
a function of X; rather than of x guarantees that f is adensity whenever K is a density. 

Optimal bandwidths for variable-kernel approaches depend on f. Pilot estima- 
tion of f can be used to guide bandwidth adaptation. Consider the following general 
strategy: 


1. Construct a pilot estimator f(x) that is strictly positive for all observed x;. 
Pilot estimation might employ, for example, a normal product kernel den- 
sity estimator with bandwidth chosen according to (10.45). If f is based 
on an estimator that may equal or approach zero at some x;, then let f(x) 
equal the estimated density whenever the estimate exceeds €, and let f(x) = € 
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FIGURE 10.8 Results from Example 10.7. From left to right, the three panels show the 
bivariate density value along the one-dimensional slice for which x2 = 0 for: the true bivariate 
t distribution with two degrees of freedom, the bivariate estimate using a fixed-bandwidth 
product kernel approach, and the bivariate estimate using Abramson’s adaptive approach as 
described in the text. 


otherwise. The choice of an arbitrary small constant € > 0 improves perfor- 
mance by providing an upper bound for adaptively chosen bandwidths. 


2. Let the adaptive bandwidth be h; = h/f(X;)%, for a sensitivity parameter 
0 <a < 1. The parameter h assumes the role of a bandwidth parameter that 
can be adjusted to control the overall smoothness of the final estimate. 


3. Apply the variable-kernel estimator in (10.50) with the bandwidths h; found in 
step 2 to produce the final estimate. 


The parameter a affects the degree of local adaptation by controlling how 
quickly the bandwidth changes in response to suspected changes in f. Asymptotic 
arguments and practical experience support setting a = 5 which yields the approach 
of Abramson [3]. Several investigators have found this approach to perform well in 
practice [598, 674]. 

An alternative proposal is a = 1/p, which yields an approach that is asymptoti- 
cally equivalent to the adaptive kernel estimator of Breiman, Meisel, and Purcell [66]. 
This choice ensures that the number of observed data points captured by the scaled 
kernel will be roughly equal everywhere [598]. In their algorithm, these authors used 
a nearest neighbor approach for f and set h; = hd,(X;) for a smoothness parameter 
h that may depend on k. 


Example 10.7 (Bivariate ¢ Distribution) To illustrate the potential benefit of 
adaptive approaches, consider estimating the bivariate ¢ distribution (with two degrees 
of freedom) from a sample of size n = 500. In the nonadaptive approach, we use a 
normal product kernel with individual bandwidths chosen using the Sheather—Jones 
approach. As an adaptive alternative, we use Abramson’s variable-kernel approach 
(a= 5) with a normal product kernel, the pilot estimate taken to be the result of 
the nonadaptive approach, € = 0.005, and h set equal to the mean of the coordinate- 
wise bandwidths used in the nonadaptive approach times the geometric mean of the 
FX). 

The left panel of Figure 10.8 shows the true values of the bivariate t distribution 
with two degrees of freedom, f, along the line x2 = 0. In other words, this shows a 
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slice from the true density. The center panel of Figure 10.8 shows the result of the 
nonadaptive approach. The tails of the estimate exhibit undesirable wiggliness caused 
by an inappropriately narrow bandwidth in the tail regions, where a few outliers 
fall. The right panel of Figure 10.8 shows the result from Abramson’s approach. 
Bandwidths are substantially wider in the tail areas, thereby producing smoother 
estimates in these regions than were obtained from the fixed-bandwidth approach. 
Abramson’s method also uses much narrower bandwidths near the estimated mode. 
There is a slight indication of this for our random sample, but the effect can sometimes 
be pronounced. 


Having discussed the variable-kernel approach, emphasizing its application in 
higher dimensions, we next consider a related approach primarily used for univariate 
data. This method illustrates the potential advantage of data transformation for density 
estimation. 

Wand, Marron, and Ruppert noted that conducting fixed-bandwidth kernel den- 
sity estimation on data that have been nonlinearly transformed is equivalent to using 
a variable-bandwidth kernel estimator on the original data [652]. The transformation 
induces separate bandwidths h; at each data point. 


Suppose univariate data X;,..., X, are observed from a density fy. Let 
t* 
jhe oa! (10.51) 
1 (X) 


denote a transformation, where ff is a monotonic increasing mapping of the support 
of f to the real line parameterized by A, and o3 and On (x) are the variances of X and 
À 


Y = (X), respectively. Then f, is a scale-preserving transformation that maps the 
random variable X ~ fx to Y having density 


d 
a0) = fx O) | Tro) . (10.52) 


For example, if X is a standard normal random variable and t{(X) = exp{X}, then 
Y has the same variance as X. However, a window of fixed width 0.3 on the Y scale 
centered at any value y has variable width when back-transformed to the X scale: The 
width is roughly 2.76 when x = —1 but only 0.24 when x = 1. In practice, sample 
standard deviations or robust measures of spread may be used in f, to preserve scale. 

Suppose we transform the data using f, to obtain Y1, ..., Yn, then construct a 
fixed-bandwidth kernel density estimate for these transformed data, and then back- 
transform the resulting estimate to the original scale to produce an estimate of fy. 
From (10.18) we know that the bandwidth that minimizes AMISE(A) for a kernel 
estimate of g, is 


H 


nox R(gy 


1/5 
L= (=) (10.53) 


for a given choice of A. 
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Since hy depends on the unknown density g,, a plug-in method is suggested to 
estimate R(gY) by (g) = R( 8), where ĝis a kernel estimator using pilot bandwidth 
ho. Wand, Marron, and Ruppert suggest using a normal kernel with Silverman’s rule 
of thumb to determine họ, thereby yielding the estimator 


7 1 Yi- Y; 
Rg) = T DA (=) (10.54) 
i$j 


where ho = v2ôx[84/7/(5n?)]!/!3 and ¢® is the fourth derivative of the standard 

normal density [652]. Since f, is scale-preserving, the sample standard deviation of 

X1,..., Xn, Say Gy, provides an estimate of the standard deviation of Y to use in the 

expression for ho. Related derivative estimation ideas are discussed in [298, 581]. 
The familiar Box—Cox transformation [57], 


x . 

ee o SAR TE (10.55) 
log x ifà =0, 
is among the parameterized transformation families available for (10.51). When any 
good transformation will suffice, or in multivariate settings, it can be useful to rely 
upon the notion that the transformation should make the data more nearly symmetric 
and unimodal because fixed-bandwidth kernel density estimation is known to perform 
well in this case. 

This transformation approach to variable-kernel density estimation can work 
well for univariate skewed unimodal densities. Extensions to multivariate data are 
challenging, and applications to multimodal data can result in poor estimates. With- 
out all the formalism outlined above, data analysts routinely transform variables to 
convenient scales using functions such as the log, often retaining this transforma- 
tion thereafter for displaying results and even making inferences. When inferences 
on the original scale are preferred, one could pursue a transformation strategy based 
on graphical or quantitative assessments of the symmetry and unimodality achieved, 
rather than optimizing the transformation within a class of functions as described 
above. 


10.4.4 Exploratory Projection Pursuit 


Exploratory projection pursuit focuses on discovering low-dimensional structure in 
a high-dimensional density. The final density estimate is constructed by modifying a 
standard multivariate normal distribution to reflect the structure found. The approach 
described below follows Friedman [206], which extends previous work [210, 338]. 
In this subsection, reference will be made to a variety of density functions 
with assorted arguments. For notational clarity, we therefore assign a subscript to the 
density function to identify the random variable whose density is being discussed. 
Let the data consist of n observations of p-dimensional variables, X1,..., Xp ~ 
iid. fx. Before beginning exploratory projection pursuit, the data are transformed 
to have mean 0 and variance—covariance matrix Ip. This is accomplished using the 
whitening or sphering transformation described in Section 10.4.2. Let fz denote the 
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density corresponding to the transformed variables, Z1,..., Zn. Both fz and fx are 
unknown. To estimate fx it suffices to estimate fz and then reverse the transformation 
to obtain an estimate of fx. Thus our primary concern will be the estimation of fz. 

Several steps in the process rely on another density estimation technique, based 
on Legendre polynomial expansion. The Legendre polynomials are a sequence of 
orthogonal polynomials on [—1, 1] defined by Po(u) = 1, Pi(u) = u, and Pj(u) = 
[(2j — lu P; (u) — (j - 1)P;j-2(u)] / j for j > 2, having the property that the L2 
norm Ji P?(u) du = 2/(2j + 1) for all j [2, 568]. These polynomials can be used 
as a basis for representing functions on [—1, 1]. In particular, we can represent a 
univariate density f that has support only on [—1, 1] by its Legendre polynomial 
expansion 


f(x) =X a; Pœ), (10.56) 
j=0 
where 
aj = ipa (10.57) 


and the expectation in (10.57) is taken with respect to f. Equation (10.57) 
can be confirmed by noting the orthogonality and L2 norm of the P;. If we 
observe X1,..., Xn ~ iid. f, then (1/7) Jai P;(X;) is an estimator of E{P;(X)}. 
Therefore 


â= > z P(X) (10.58) 


may be used as estimates of the coefficients in the Legendre expansion of f. Truncating 
the sum in (10.56) after J + 1 terms suggests the estimator 


J 
f@) =X 4; Pw). (10.59) 
j=0 


Having described this Legendre expansion approach, we can now move on to study 
exploratory projection pursuit. 

The first step of exploratory projection pursuit is a projection step. If Y; = «T Z;, 
then we say that Y; is the one-dimensional projection of Z; in the direction a. The goal 
of the first step is to project the multivariate observations onto the one-dimensional 
line for which the distribution of the projected data has the most structure. 

The degree of structure in the projected data is measured as the amount of 
departure from normality. Let U(y) = 2®(y) — 1, where © is the standard normal 
cumulative distribution function. If Y ~ N(O, 1), then U(Y) ~ Unif(—1, 1). To mea- 
sure the structure in the distribution of Y it suffices to measure the degree to which 
the density of U(Y) differs from Unif(—1, 1). 
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Define a structure index as 


sœ) = Í [z (u) au Ru) — 1 (10.60) 
> U 2 U 2’ . 


where fy is the probability density function of U(a'Z) when Z ~ fz. When S(a) is 
large, a large amount of nonnormal structure is present in the projected data. When 
S(a) is nearly zero, the projected data are nearly normal. Note that S(a@) depends on 
fu, which must be estimated. 

To estimate S(œ) from the observed data, use the Legendre expansion for fy to 
reexpress R( fy) in (10.60) as 


X 2j+1 2 
R(fu) = 5 [E{P(U)}]", (10.61) 
A 2 
j=0 
where the expectations are taken with respect to fy. Since U (TZ), ..., U(e'Zn) 


represent draws from fy, the expectations in (10.61) can be estimated by sample 
moments. If we also truncate the sum in (10.61) at J + 1 terms, we obtain 


J 


2 

7 2j+1 (1% 1 

Se) =Y A (£3 revez- n] a5 (10.62) 
i=1 


j=0 


as an estimator of S(œ). 

Thus, to estimate the projection direction yielding the greatest nonnormal struc- 
ture, we maximize (æ) with respect to œ, subject to the constraint that ala = 1. 
Denote the resulting direction by å. Although & is estimated from the data, we treat 
it as a fixed quantity when discussing distributions of projections of random vectors 
onto it. For example, let Satz denote the univariate marginal density of atZ when 
Z ~ fz, treating Z as random and & as fixed. 

The second step of exploratory projection pursuit is a structure removal step. 
The goal is to apply a transformation to Z1, ..., Zn which makes the density of the 
projection of fz on @ a standard normal density, while leaving the distribution of a 
projection along any orthogonal direction unchanged. To do this, let A; be an orthonor- 
mal matrix with first row equal to at. Also, for observations from arandom vector V = 
(Vi, ..., Vp), define the vector transformation T(v) = (7! (Fy,(v1)), U2, «++, vp), 
where Fy, is the cumulative distribution function of the first element of V. Then letting 


Z = ATT(A\Z)) (10.63) 


for i= 1,...,n would achieve the desired transformation. The transformation in 
(10.63) cannot be used directly to achieve the structure removal goal because it de- 
pends on the cumulative distribution function corresponding to Satz: To get around 
this problem, simply replace the cumulative distribution function with the correspond- 
ing empirical distribution function of atZi, ae OZ. An alternative replacement 
is suggested in [340]. 
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The z». for i = 1,...,n may be viewed as a new dataset consisting of the 


observed values of random variables Z, Kn be ZY) whose unknown distribution fzo 
depends on fz. There is an important relationship between the conditionals deter- 
mined by fay, ie Jz given a projection onto 1. Specifically, the conditional dis- 
WA ) 


tribution of z? ) given a equals the conditional distribution of Z; given atZ; 


because the structure removal step creating y leaves all coordinates of Z; except 
the first unchanged. Therefore 


faoin 


fzo (z) = Taral@l J 


(10.64) 


Equation (10.64) provides no immediate way to estimate fz, but iterating the entire 
process described above will eventually prove fruitful. 

Suppose a second projection step is conducted. A new direction to project 
the working variables VA 5 ZY) is sought to isolate the greatest amount of one- 
dimensional structure. Finding this direction requires the calculation of a new structure 
index based on the transformed sample VARA Kisi ZO. leading to the estimation of &2 
as the projection direction revealing greatest structure. 

Taking a second structure removal step requires the reapplication of equa- 
tion (10.63) with a suitable matrix A2, yielding new working variables ZO, alts ZØ, 

Iterating the same conditional distribution argument as expressed in (10.64) 
allows us to write the density from which the new working data arise as 


E T E (10.65) 


aT 
fat g(@}® Sgt z(G32) 


where fatzo is the marginal density of eZ) when Z® ~ fzo. 

Suppose the projection and structure removal steps are iterated several addi- 
tional times. At some point, the identification and removal of structure will lead to 
new variables whose distribution has little or no remaining structure. In other words, 
their distribution will be approximately normal along any possible univariate projec- 
tion. At this point, iterations are stopped. Suppose that a total of M iterations were 
taken. Then (10.65) extends to give 


ima = ma J] (10.66) 


ae Zon-1) ry z) 


where fgt zm-1) is the marginal density of aT Z™—) when Z"-) ~ fzm-), and 
ZO ~ fz. 

Now, Equation (10.66) can be used to estimate fz because—having eliminated 
all structure from the distribution of our working variables Zz) __we may set fz 
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equal to a p-dimensional multivariate normal density, denoted øp. Solving for fz 
gives 


fat zon- (âT Z) 


JETO (10.67) 


fu@) = b(t) ll 


m=1 


Although this equation still depends on the unknown densities Sat zim—1), these 


can be estimated using the Legendre approximation strategy. Note that if U"—) = 
20(@TZ"—Y) — 1 for Z"—) ~ fzm-), then 


far zm» (P7! (Cu + 1)/2)) 


fym-»d(u) = (10.68) 
á 26 (©! (u + 1)/2)) 
Use the Legendre expansion of fym-1 and sample moments to estimate 
(m—1) 
A 2j+1 P;(U; ) 
fum-n(u) = 3 z PU pa (10.69) 
j=0 
from U; TENi ,U®™-D derived from ze EONA Z"—-)_ After substituting 
foad for Feni in (10.68) and isolating es we obtain 
far pn—)(G yD) = 2F yon—v (20), 2) — 1) 6G ),2)- (10.70) 
Thus, from (10.67) the estimate for fz,(z) is 
f@ = $p(2) il Soj + 1)P; (28612) — 1) Pim >. (10.71) 
m=1 | j=0 
where the 
1 n 
E AT r7(m—1) 
Pim = = x P; (20¢@7,2)" j= 1) (10.72) 


are estimated using the working variables accumulated during the structure removal 
process, and z? = Zi. Reversing the sphering by applying the change of variable 
X = PA!?Z +x to Sa provides the estimate fx. 

The estimate f z is most strongly influenced by the central portion of the data 
because the transformation U compresses information about the tails of fz into the 
extreme portions of the interval [—1, 1]. Low-degree Legendre polynomial expansion 
has only limited capacity to capture substantial features of fy in these narrow margins 
of the interval. Furthermore, the structure index driving the choice of each m will not 
assign high structure to directions for which only the tail behavior of the projection 
is nonnormal. Therefore, exploratory projection pursuit should be viewed foremost 
as a way to extract key low-dimensional features of the density which are exhibited 
by the bulk of the data, and to reconstruct a density estimate that reflects these key 
features. 
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FIGURE 10.9 First two projection and structure removal steps for Example 10.8, as described 
in the text. 


Example 10.8 (Bivariate Rotation) To illustrate exploratory projection pursuit, 
we will attempt to reconstruct the density of some bivariate data. Let W = (Wj, W2), 
where W; ~ Gamma(4, 2) and W2 ~ N(O, 1) independently. Then E{W} = (2, 0) 


and var{W} = I. Use 
R= —0.581 —0.814 
~ \ 0.814 0.581 


to rotate W to produce data via X = RW. Let fx denote the density of X, which 
we will try to estimate from a sample of n = 500 points drawn from fx. Since 
var{X} = RR! = I, the whitening transformation is nearly just a translation (aside 
from the fact that the theoretical and sample variance—covariance matrices differ 
slightly). 

The whitened data values, z1,...,Z509, are plotted in the top left panel of 
Figure 10.9. The underlying gamma structure is detectable in this graph as an abruptly 
declining frequency of points in the top right region of the plot: Z and X are rotated 
about 135 degrees counterclockwise with respect to W. 

The direction & that reveals the most univariate projected structure is shown 
with the line in the top left panel of Figure 10.9. Clearly this direction corresponds 
roughly to the original gamma-distributed coordinate. The bottom left panel of Fig- 
ure 10.9 shows a histogram of the z; values projected onto &, revealing a rather 
nonnormal distribution. The curve superimposed on this histogram is the univariate 
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FIGURE 10.10 Exploratory projection pursuit density estimate Îz for Example 10.8. 


density estimate for â!Z obtained using the Legendre expansion strategy. Throughout 
this example, the number of Legendre polynomials was set to J + 1 = 4. 

Removing the structure revealed by the projection on & yields new working 
data values, Zz, dake 1; graphed in the top right panel of Figure 10.9. The projection 
direction, 2, showing the most nonnormal structure is again shown with a line. The 
bottom right panel shows a histogram of a2) values and the corresponding Legendre 
density estimate. 

At this point, there is little need to proceed with additional projection and 
structure removal steps: The working data are already nearly multivariate normal. 
Employing (10.71) to reconstruct an estimate of fz yields the density estimate shown 
in Figure 10.10. The rotated gamma-—normal structure is clearly seen in the figure, 
with the heavier gamma tail extending leftward and the abrupt tail terminating on 
the right. The final step in application would be to reexpress this result in terms of a 
density for X rather than Z. 


PROBLEMS 


10.1. Sanders et al. provide a comprehensive dataset of infrared emissions and other charac- 
teristics of objects beyond our galaxy [567]. These data are available from the website 
for our book. Let X denote the log of the variable labeled F12, which is the total 
12-um-band flux measurement on each object. 


a. Fit a normal kernel density estimate for X, using bandwidths derived from 
the UCV(h) criterion, Silverman’s rule of thumb, the Sheather—Jones approach, 
Terrell’s maximal smoothing principle, and any other approaches you wish. 
Comment on the apparent suitability of each bandwidth for these data. 


b. Fit kernel density estimates for X using uniform, normal, Epanechnikov, and tri- 
weight kernels, each with bandwidth equivalent to the Sheather-Jones bandwidth 
for a normal kernel. Comment. 


c. Fit a nearest neighbor density estimate for X as in (10.48) with the uniform and 
normal kernels. Next fit an Abramson adaptive estimate for X using a normal 
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kernel and setting h equal to the Sheather-Jones bandwidth for a fixed-bandwidth 
estimator times the geometric mean of the f x(x;)!/? values. 


. If code for logspline density estimation is available, experiment with this approach 


for estimating the density of X. 


. Let f x denote the normal kernel density estimate for X computed using the 


Sheather—Jones bandwidth. Note the ratio of this bandwidth to the bandwidth given 
by Silverman’s rule of thumb. Transform the data back to the original scale (i.e., 
Z = exp{X}), and fit a normal kernel density estimate fz, using a bandwidth equal 
to Silverman’s rule of thumb scaled down by the ratio noted previously. (This is an 
instance where the robust scale measure is far superior to the sample standard devi- 
ation.) Next, transform f x back to the original scale using the change-of-variable 
formula for densities, and compare the two resulting estimates of density for Z 
on the region between 0 and 8. Experiment further to investigate the relationship 
between density estimation and nonlinear scale transformations. Comment. 


10.2. This problem continues using the infrared data on extragalactic objects and the variable 
X (the log of the 12-um-band flux measurement) from Problem 10.1. The dataset also 
includes F100 data: the total 100-,.m-band flux measurements for each object. Denote 
the log of this variable by Y. Construct bivariate density estimates for the joint density 
of X and Y using the following approaches. 


10.3. 


a. 


Use a standard bivariate normal kernel with bandwidth matrix hI,. Describe how 
you chose h. 


Use a bivariate normal kernel with bandwidth matrix H chosen by Terrell’s maximal 
smoothing principle. Find a constant c for which the bandwidth matrix cH provides 
a superior density estimate. 


Use a normal product kernel with the bandwidth for each coordinate chosen using 
the Sheather—-Jones approach. 


Use a nearest neighbor estimator (10.48) with a normal kernel. Describe how you 
chose k. 


Use an Abramson adaptive estimator with the normal product kernel and band- 
widths chosen in the same manner as Example 10.7. 


Starting from Equation (10.22), derive a simplification for UCV(h) when K(z) = 
o(z) = exp{—z?/2}/./2z, pursuing the following steps: 


a. 


Show that 


UCV(h) = a) © (=5) dx 
E 
DAE a 


i=l j#i 
2 " Xi- X; 
Zeot eee K i i) 
a ( h 
i=l j#i 
=A+B+4+C, 


where A, B, and C denote to the three terms given above. 
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b. Show that 
AS 1 
~ 2nh sT 


c. Show that 


ae eee rly xy 
A= E DhJx 2, 2 exp { ga iT% \ (10.73) 
i=l jAi 


d. Finish by showing (10.23). 


10.4. Replicate the first four rows of Table 10.2. Assume f is a product kernel estimator. 


10.5. 


You may find it helpful to begin with the expression MSE,,( f (x)) = var{ f (x)} + 
(bias { f wH’, and to use the result 


. Cy nee eee bt? +0? ort 
{2 {-(u — v)?/[2(o? + I} 
\/2n(o? + T?) f 


where ¢(x; œ, 6”) denotes a univariate normal density function with mean œ and 
variance 7. 


Available from the website for this book are some manifold data exhibiting some strong 
structure. Specifically, these four-dimensional data come from a mixture distribution, 
with a low weighting of a density that lies nearly on a three-dimensional manifold and 
a high weighting of a heavy-tailed density that fills four-dimensional space. 


a. Estimate the direction of the least normal univariate projection of these data. Use a 
sequence of graphs to guess anonnormal projection direction, or follow the method 
described for the projection step of exploratory projection pursuit. 


b. Estimate the univariate density of the data projected in the direction found in 
part (a), using any means you wish. 


c. Use the ideas in this chapter to estimate and/or describe the density of these data 
via any productive means. Discuss the difficulties you encounter. 


melas 


DENSITY ESTIMATION 
AND SMOOTHING 


There are three concepts that connect the remaining chapters in this book. 
First, the methods are generally nonparametric. The lack of a formal sta- 
tistical model introduces computing tasks beyond straightforward parameter 
estimation. 

Second, the methods are generally intended for description rather than 
formal inference. We may wish to describe the probability distribution of a 
random variable or estimate the relationship between several random vari- 
ables. 

The most interesting questions in statistics ask how one thing depends 
on another. The paragon of all statistical strategies for addressing this question 
is the concept of regression (with all its forms, generalizations, and analogs), 
which describes how the conditional distribution of some variables depends 
on the value of other variables. 

The standard regression approach is parametric: One assumes an explicit, 
parameterized functional relationship between variables and then estimates 
the parameters using the data. This philosophy embraces the rigid assumption 
of a prespecified form for the regression function and in exchange enjoys the 
potential benefits of simplicity. Typically, all the data contribute to parameter 
estimation and hence to the global fit. The opposite trade-off is possible, 
however. We can reject the parametric assumptions in order to express the 
relationship more flexibly, but the estimated relationship can be more complex. 

Generally, we will call these approaches smoothing, and this brings us 
to the third theme of the chapters ahead: The methods are usually based on the 
concept of local averaging. Within small neighborhoods of predictor space 
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a summary Statistic (e.g., the mean) of the values of the response variable(s) 
within that neighborhood is used to describe the relationship. We will see that 
the local averaging concept is also implicit in our chapter on density estimation 
that begins this part of the book. Nonparametric density estimation is useful 
because for most real-world problems an appropriate parametric form for the 
density is either unknown or doesn’t exist. 

Thus, the primary focus of the remaining chapters is on nonparametric 
methods to describe and estimate densities or relationships using the philos- 
ophy of local averaging. Along the way, we detour into some related topics 
that extend these concepts in interesting alternative directions. 


CHAPTER 1 1 


BIVARIATE SMOOTHING 


Consider the bivariate data shown in Figure 11.1. If asked, virtually anyone could draw 
a smooth curve that fits the data well, yet most would find it surprisingly difficult to 
describe precisely how they had done it. We focus here on a variety of methods for 
this task, called scatterplot smoothing. 

Effective smoothing methods for bivariate data are usually much simpler than 
for higher-dimensional problems; therefore we initially limit consideration to the 
case of n bivariate data points (x;, yj), i = 1,..., n. Chapter 12 covers smoothing 
multivariate data. 

The goal of smoothing is different for predictor-response data than for general 
bivariate data. With predictor-response data, the random response variable Y is as- 
sumed to be a function (probably stochastic) of the value of a predictor variable X. For 
example, a model commonly assumed for predictor—response data is Y; = s (xj) + €i, 
where the €; are zero-mean stochastic noise and s is a smooth function. In this case, 
the conditional distribution of Y|x describes how Y depends on X = x. One sensible 
smooth curve through the data would connect the conditional means of Y|x for the 
range of predictor values observed. 

In contrast to predictor-response data, general bivariate data have the charac- 
teristic that neither X or Y is distinguished as the response. In this case, it is sensible 
to summarize the joint distribution of (X, Y). One smooth curve that would capture 
a primary aspect of the relationship between X and Y would correspond to the ridge 
top of their joint density; there are other reasonable choices, too. Estimating such re- 
lationships can be considerably more challenging than smoothing predictor—-response 
data; see Sections 11.6 and 12.2.1. 

Detailed discussion of smoothing techniques includes [101, 188, 308, 309, 314, 
322, 573, 599, 642, 651]. 


11.1 PREDICTOR-RESPONSE DATA 


Suppose that E{Y|x} = s(x) for a smooth function s. Because smoothing predictor— 
response data usually focuses on estimation of the conditional mean function s, 
smoothing is often called nonparametric regression. 

For a given point x, let S(x) be an estimator of s(x). What estimator is best? 
One natural approach is to assess the quality of ŝ(x) as an estimator of s(x) at x using 
the mean squared error (of estimation) at x, namely MSE(S(x)) = Ef{[s(x) — s(x) P}, 
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FIGURE 11.1 Predictor-response data. A smooth curve sketched through these data would 
likely exhibit several peaks and troughs. 


where the expectation is taken with respect to the joint distribution of the responses. 
By adding and subtracting E{5(x)|x} inside the squared term in this expression, it is 
straightforward to obtain the familiar result that 


MSE(5(x)) = (bias {8(x)})? + var {8}, (11.1) 


where bias{s(x)} = E{S(x)} — s(x). 

Although we motivate smoothing by considering estimation of conditional 
means under squared error loss, alternative viewpoints are reasonable. For exam- 
ple, using absolute error loss shifts focus to the median{Y |x}. Thus, smoothing may 
be seen more generally as an attempt to describe how the center of the distribution of 
Y|x varies with x, for some notion of what constitutes the center. 

The smoother S(x) is usually based not only on the observed data (x;, y;), for 
i=1,...,n, but also on a user-specified smoothing parameter à, whose value is 
chosen to control the overall behavior of the smoother. Thus, we often write $, and 
MSE, (S,(x)) hereafter. 

Consider prediction of the response at a new point x*, using the smoother S$). 
We introduced MSE, (8, (x*)) to assess the quality of $} (x*) as an estimator of the true 
conditional mean, s(x*) = E{Y|X = x*}. Now, to assess the quality of the smoother 
as a predictor of a single response at X = x*, we use the mean squared prediction 
error (MSPE) at x*, namely 


MSPE,(5,(x*)) = E{(Y — ON? | X = x*} 
= var{Y|X = x*} + MSE,(8,(x*)). (11.2) 


More should be required of $, beyond good prediction at a single x*. If 8, is a 
good smoother, it should limit MSPE)(S,(x)) over a range of x. For the observed 
dataset, a good global measure of the quality of §, = (S,(41), ...,8,(%n)) would 
be MSPE, (81) = (1/n) ey MSPE,(S,(x;)), namely the average mean squared 
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prediction error. There are other good global measures of the quality of a smooth, but 
in many cases the choice is asymptotically unimportant in the sense that they provide 
equivalent asymptotic guidance about optimal smoothing [313]. 

Having discussed theoretical measures of performance of smoothers, we now 
turn our focus to practical methods for constructing smoothers that perform well. For 
predictor—response data, it’s difficult to resist the notion that a smoother should sum- 
marize the conditional distribution of Y; given X; = x; by some measure of location 
like the conditional mean, even if the model Y; = s (x;) + €; is not assumed explicitly. 
In fact, regardless of the type of data, nearly all smoothers rely on the concept of local 
averaging. The Y; whose corresponding x; are near x should be averaged in some way 
to glean information about the appropriate value of the smooth at x. 

A generic local-averaging smoother can be written as 


s(x) = ave {Y;|x; € MOD} (11.3) 


for some generalized average function “ave” and some neighborhood of x, say Mx). 
Different smoothers result from different choices for the averaging function (e.g., 
mean, weighted mean, median, or M-estimate) and the neighborhood (e.g., the nearest 
few neighboring points, or all points within some distance). In general, the form of 
N(x) may vary with x so that different neighborhood sizes or shapes may be used in 
different regions of the dataset. 

The most important characteristic of a neighborhood is its span, which is repre- 
sented by the smoothing parameter A. In a general sense, the span of a neighborhood 
measures its inclusiveness: Neighborhoods with small span are strongly local, in- 
cluding only very nearby points, whereas neighborhoods with large span have wider 
membership. There are many ways to measure a neighborhood’s inclusiveness, includ- 
ing its size (number of points), span (proportion of sample points that are members), 
bandwidth (physical length or volume of the neighborhood), and other concepts dis- 
cussed later. We use À to denote whichever concept is most natural for each smoother. 

The smoothing parameter controls the wiggliness of $}. Smoothers with small 
spans tend to reproduce local patterns very well but draw little information from more 
distant data. A smoother that ignores distant data containing useful information about 
the local response will have higher variability than could otherwise be achieved. In 
contrast, smoothers with large spans draw lots of information from distant data when 
making local predictions. When these data are of questionable relevance, potential 
bias is introduced. Adjusting à controls this trade-off between bias and variance. 

Below we introduce some strategies for constructing local-averaging smoothers. 
This chapter focuses on smoothing methods for predictor-response data, but 
Section 11.6 briefly addresses issues regarding smoothing general bivariate data, 
which are further considered in Chapter 12. 


11.2 LINEAR SMOOTHERS 


Animportant class of smoothers are the linear smoothers. For such smoothers, the pre- 
diction at any point x is a linear combination of the response values. Linear smoothers 
are faster to compute and easier to analyze than nonlinear smoothers. 
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Frequently, it suffices to consider estimation of the smooth at only the observed 
x; points. For a vector of predictor values, x = (x; ... Xn)”, denote the vector of corre- 
sponding response variables as Y = (Y; ... Y,)!, and define § = (8 (x1) ... §(xn))!. 
Then a linear smoother can be expressed as $ = SY for an n x n smoothing matrix 
S whose entries do not depend on Y. A variety of linear smoothers are introduced 
below. 


11.2.1 Constant-Span Running Mean 


A very simple smoother takes the sample mean of k nearby points: 


Y; 
Ro)= So =. (11.4) 
(i: Na) 


Insisting on odd k, we define M (x;) as x; itself, the (k — 1)/2 points whose predictor 
values are nearest below x;, and the (k — 1)/2 points whose predictor values are 
nearest above x;. This \V(x;) is termed the symmetric nearest neighborhood, and the 
smoother is sometimes called a moving average. 

Without loss of generality, assume hereafter that the data pairs have been sorted 
so that the x; are in increasing order. Then the constant-span running-mean smoother 
can be written as 


k-1 k-1 
SE(xj) = mean{Y; for max (i -5 1) <j< min (i + 5 „AJY: (11.5) 
For the purposes of graphing or prediction, one can compute $ at each of the x; and 
interpolate linearly in between. Note that by stepping through 7 in order, we can 
efficiently compute 5; at xj41 with the recursive update 


Yji-(k-1)/2 a Vi+(k+1)/2 
k k ` 


Sk(xi+1) = Sk (xi) (11.6) 
This avoids recalculating the mean at each point. An analogous update holds for points 
whose predictor values lie near the edges of the data. 


The constant-span running-mean smoother is a linear smoother. The middle 


rows of the smoothing matrix S resemble (0 ...0 i sigs i 0... 0) . An important 


detail in most smoothing problems is how to compute S;(x;) near the edges of the 
data. For example, x; does not have (k — 1)/2 neighbors to its left. Some adjustment 
must be made to the top and bottom (k — 1)/2 rows of S. Three possible choices (e.g., 
for k = 5) are to shrink symmetric neighborhoods by using 


100000 0 
1 1 1 
1 l looo 0 
1 1 1 =1 SE 
S=|5 5 5 5 3 9 0]. (11.7) 
0% 3 3 3 3 0 
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FIGURE 11.2 Results from a constant-span running-mean smoother with k = 13 (solid line), 
compared with the true underlying curve (dotted line). 


to truncate neighborhoods by using 


1 a 4 
333000 0 
Te, be (a 
444 4 0 0 0 
1 1 21 1 1 
S=|5 5 3 5 5 9 H (11.8) 
t ånni al 
9 5 555353 0 
or—in the case of circular data only—to wrap neighborhoods by using 
1 1 1 1 1 
Bn tg age O ODE ies TOE ee 
1." dy “de Al 1 
5 5 5 5 9 O 0 0 3 
1 1l 1 1 4 
s-]5 55 5 3 0 0 (11.9) 
1 1 1 1 1 
ON Be a S 0 


The truncation option is usually preferred, and is implicit in (11.5). Since k is intended 
to be a rather small fraction of n, the overall picture presented by the smooth is not 
greatly affected by the treatment of the edges, but regardless of how this detail is 
addressed, readers should be aware of the reduced reliability of $ at the edges of the 
data. 


Example 11.1 (Easy Data) Figure 11.2 shows a constant-span running-mean 
smooth of the data introduced at the start of this chapter, which are easy to smooth well 
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using a variety of methods we will discuss. These data are n = 200 equally spaced 
points from the model Y; = s (x;) + €;, where the errors are mean-zero i.i.d. normal 
noise with a standard deviation of 1.5. The data are available from the website for this 
book. The true relationship, s(x) = x? sin {(x + 3.4) /2}, is shown with a dotted line; 
the estimate $,(x) is shown with the solid line. We used a smoothing matrix equivalent 
to (11.8) for k = 13. The result is not visually appealing: Perhaps this emphasizes the 
surprising sophistication of whatever methods people employ when they sketch in a 
smooth curve by hand. 


11.2.1.1 Effect of Span A natural smoothing parameter for the constant-span 
running-mean smoother is A = k. As for all smoothers, this parameter controls wig- 
gliness, here by directly controlling the number of data points contained in any neigh- 
borhood. For sorted data and an interior point x; whose neighborhood is not affected 
by the data edges, the span-k running-mean smoother given by (11.5) has 


i+(k=1)/2 2 
1 
MSE (sk (xi)) = zi (s (i) -— 5 5 r) \ (11.10) 


j=i—(k—1)/2 


where, recall, s (x;) = E{Y|X = xi}. It is straightforward to reexpress this as 


i+(k—1)/2 
m : i 2 
MSE: (sk(x;)) = (bias{sk(x;)}) F 5 var{Y|X = xj}, (11.11) 
j=i—(k—1)/2 
where 
i+(k-1)/2 
bias{S;(xj)} = s (xi) — a 5. s(x;). (11.12) 
j=i—(k—1)/2 


To understand how the mean squared prediction error depends on the smoothing span, 
we can use (11.11) and make the simplifying assumption that var{Y|X = xj} = o? 
for all x; € N(x;). Then 


MSPE;(ŝk(x;)) = var{ Y| X = xi} + MSE (sk (x;)) 


= (1 + z) o? + (bias{sx(xi)})°. (11.13) 


Therefore, as the neighborhood size k is increased, the variance term in (11.13) de- 
creases, but the bias term will typically increase because s (x;) will not likely be similar 
to s(x;) for distant j. Likewise, if k is decreased, the variance term will increase, but 
the bias term will usually be smaller. 


Example 11.2 (Easy Data, Continued) Figure 11.3 illustrates how k influences 
sg. In this graph, k = 3 leads to a result that is far too wiggly. In contrast, k = 43 
leads to a result that is quite smooth but systematically biased. The bias arises when 
a neighborhood is so wide that the response values at the fringes of the neighborhood 
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FIGURE 11.3 Results from a constant-span running-mean smoother with k = 3 (wigglier 
solid line) and k = 43 (smoother solid line). The underlying true curve is shown with a 
dotted line. 


are not representative of the response at the middle. This tends to erode peaks, fill in 
troughs, and flatten trends near the edges of the range of the predictor. 


11.2.1.2 Span Selection for Linear Smoothers The best choice for k clearly 
must balance a trade-off between bias and variance. For small k, the estimated 
curve will be wiggly but exhibit more fidelity to the data. For large k, the estimated 
curve will be smooth but exhibit substantial bias in some regions. For all smoothers, 
the role of the smoothing parameter is to control this trade-off between bias and 
variance. 

An expression for MSPE,;(S,) can be obtained by averaging values from (11.13) 
over all x;, but this expression cannot be minimized to choose k because it depends 
on unknown expected values. Furthermore, it may be more reasonable to choose the 
span that is best for the observed data, rather than the span that is best on average 
for datasets that might have been observed but weren’t. Therefore, we might consider 
choosing the k that minimizes the residual mean squared error (RSS) 


RSS:Ĝ}) 1< ` 
e “= z 2% = k). (11.14) 
i= 
However, 

RSSi(8 —_— 1 
E {ee} = MSPE,(8;) — — os cov{Y;, Se(xj)}. (11.15) 

n n 

i+ j 


For constant-span running means, cov{Y;, $k(x;j)} = var{Y|X = xj}/k for interior xj. 
Therefore, RSS;(S%)/n is a downward-biased estimator of MSPE;(8;). 
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FIGURE 11.4 Plot of CVRSS,(S,) versus k for the constant-span running-mean smoother 
applied to the data in Figure 11.1. Good choices for k range between about 11 and 23. The 
smaller values in this range would be especially good at bias reduction, whereas the larger ones 
would produce smoother fits. 


To eliminate the correlation between Y; and s;(x;), we may omit the ith point 
when calculating the smooth at x;. This process is known as cross-validation [616]; 
it is used only for assessing the performance of the smooth, not for fitting the 
smooth itself. Denote by s(x) the value of the smooth at x; when it is fitted using 
the dataset that omits the ith data pair. A better (indeed, pessimistic) estimator of 
MSPE;(S;) is 


CRS) _ -15 (n) (11.16) 


where CVRSS;(8;) is called the cross-validated residual sum of squares. Typically, 
CVRSS; (8x) is plotted against k. 


Example 11.3 (Easy Data, Continued) Figure 11.4 shows a plot of CVRSS;(S,) 
against k for smoothing the data introduced in Example 11.1. This plot usually shows 
a steep increase in CVRSS;(8;) for small k due to increasing variance, and a grad- 
ual increase in CVRSS;(8;) for large k due to increasing bias. The region of best 
performance is where the curve is lowest; this region is often quite broad and rather 
flat. In this example, good choices of k range between 11 and 23, with k = 13 being 
optimal. Minimizing CVRSS;(S%) with respect to k often produces a final smooth 
that is somewhat too wiggly. Undersmoothing can be reduced by choosing a larger k 
within the low CVRSS;(8,) valley in the cross-validation plot, which corresponds to 
good performance. In this example, k = 23 would be worth trying. 
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This approach of leave-one-out cross-validation is time consuming, even for 
linear smoothers, since it seems to require computing n separate smooths of slightly 
different datasets. Two shortcuts are worth mentioning. 

First, consider a linear smoother with smoothing matrix S. The proper fit at x; 
when the ith data pair is omitted from the dataset is a somewhat imprecise concept, 
even for a constant-span running-mean smoother, because smooths are typically cal- 
culated only at the x; values in the dataset. Should smooths be fitted at the two data 
points adjacent to the omitted x;, with linear interpolation used in between, or should 
some other approach be tried? The most unambiguous way to proceed is to define 


n 


-i Y;Sij 
aCi) jij 
y= ; 11.17 
a Dera (11.17) 
j=l 
j#i 


where Sj; is the (i, j)th element of S. In other words, the ith row of S is altered by 
replacing the (i, j)th element of S with zero and rescaling the remainder of the row so 
it sums to 1. In this case, to compute CVRSS;(S;) there is no need to actually delete 
the ith observation and recompute the smooth for each i. Following from (11.17), it 
can be shown that for linear smoothers, (11.16) can be reexpressed as 


A n E . 2 
CVRSSi(8k) _ S3 g se) l (11.18) 


1 — Sii 


n Sem 
This approach is analogous to the well-known shortcut for calculating deleted resid- 
uals in linear regression [483] and is further justified in [322]. 

Second, one may wish to reduce the number of cross-validation computations 
by generating fewer partial datasets, each with a greater number of points omitted. 
For example, one could randomly partition the observed dataset into 10 portions, then 
leave out one portion at a time. The cross-validated residual sum of squares would 
then be accumulated from the residuals of the points omitted in each portion. This 
approach tends to overestimate the true prediction error, while leaving-one-out is less 
biased but more variable; 5- or 10-fold cross-validation (i.e., 5-10 portions) has been 
recommended [323]. 

We mentioned above that different smoothers employ different smoothing pa- 
rameters to control wiggliness. So far, we have focused on the number (k) or fraction 
(k/n) of nearest neighbors. Another reasonable choice, M(x) = {x; : |x; — x| < h}, 
uses the positive real-valued distance h as a smoothing parameter. There are also 
schemes for weighting points based on their proximity to x, in which case the smooth- 
ing parameter may relate to these weights. Usually, the number of points in a neigh- 
borhood is smaller near the boundaries of the data, meaning that any fixed span chosen 
by cross-validation or another method may provide a poorer fit near the boundaries 
than in the middle of the data. The span may also be allowed to vary locally. For such 
alternative parameterizations of neighborhoods, plotting the cross-validated residual 
sum of squares and drawing conclusions about the bias—variance trade-off proceed in 
a fashion analogous to the preceding discussion. 
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Cross-validated span selection is not limited to the constant-span running-mean 
smoother. The same strategy is effective for most other smoothers discussed in this 
chapter. The trade-off between bias and variance is a fundamental principle in many 
areas of statistics: It arose previously for density estimation (Chapter 10), and it is 
certainly a major consideration for all types of smoothing. 

There are a wide variety of other methods for choosing the span for a scatterplot 
smoother, resulting in different bias—variance trade-offs [309, 310, 314, 322, 323]. 
One straightforward approach is to replace CVRSS with another criterion such as Cp, 
AIC, or BIC [323]. Two other popular alternatives are generalized cross-validation 
(GCVRSS) and plug-in methods [311, 564, 599]. In generalized cross-validation, 
(11.16) is replaced by 


RSSx(8x) 


GCVRSS; (8%) = (1 — tr{S}/n)?’ 


(11.19) 


where tr{S} denotes the sum of the diagonal elements of S. For equally spaced xi, 
CVRSS and GCVRSS give similar results. When the data are not equally spaced, span 
selection based on GCVRSS is less affected by observations that exert strong influ- 
ence on the fit. Notwithstanding this potential benefit of generalized cross-validation, 
reliance on GCVRSS often results in significant undersmoothing. Plug-in methods 
generally derive an expression for the expected mean squared prediction error or some 
other fitting criterion, whose theoretical minimum is found to depend on the type of 
smoother, the wiggliness of the true curve, and the conditional variance of Y |x. A pre- 
liminary smooth is completed using a span chosen informally (or by cross-validation). 
Then this smooth is used to estimate the unknown quantities in the expression for the 
optimal span, and the result is used in a final smooth. 

It is tempting to select the span selection method that yields the picture most 
pleasing to your eye. That is fine, but itis worthwhile admitting up front that scatterplot 
smoothing is often an exercise in descriptive—not inferential—statistics, so selecting 
your favorite span from trial and error or a simple plot of CVRSS is as reasonable 
as the opportunistic favoring of any technical method. Since spans chosen by cross- 
validation vary with the random dataset observed and sometimes undersmooth, it is 
important for practitioners to develop their own expertise based on hands-on analysis 
and experience. 


11.2.2 Running Lines and Running Polynomials 


The constant-span running-mean smoother exhibits visually unappealing wiggliness 
for any reasonable k. It also can have strong bias at the edges because it fails to 
recognize the local trend in the data. The running-line smoother can mitigate both 
problems. 

Consider fitting a linear regression model to the k data points in M(x;). Then 
the least squares linear regression prediction at x is 


(x) = Y; + Bix — ži), (11.20) 
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FIGURE 11.5 Plot of the running-line smooth for k = 23 (solid line) and the true underlying 
curve (dotted line). 


where Y;, X;, and Bi are the mean response, the mean predictor, and the estimated slope 
of the regression line, respectively, for the data in M(x;). The running-line smooth at 
Xj is Sk(Xj) = lix). 

Let X; = (1 x;), where 1 is a column of ones and x; is the column vector of 
predictor data in M(x;); and let Y; be the corresponding column vector of response 
data. Then note that £;(x;)—and hence the smooth at x;—is obtained by multiplying 
Y; by one row of H; = X; (XTX) XT. (Usually H; is called the ith hat matrix.) 
Therefore, this smoother is linear, with a banded smoothing matrix S whose nonzero 
entries are drawn from an appropriate row of each H;. Computing the smooth directly 
from S is not very efficient. For data ordered by x;, it is faster to sequentially update the 
sufficient statistics for regression, analogously to the approach discussed for running 
means. 


Example 11.4 (Easy Data, Continued) Figure 11.5 shows a running-line smooth 
of the data introduced in Example 11.1, forthe span k = 23 chosen by cross-validation. 
The edge effects are much smaller and the smooth is less jagged than with the constant- 
span running-mean smoother. Since the true curve is usually well approximated by a 
line even for fairly wide neighborhoods, k may be increased from the optimal value for 
the constant-span running-mean smoother. This reduces variance without seriously 
increasing bias. 


Nothing in this discussion limits the local fitting to simple linear regression. 
A running polynomial smoother could be produced by setting s;(x;) to the value at 
x; of a least squares polynomial regression fit to the data in M(x;). Such smoothers 
are sometimes called local regression smoothers (see Section 11.2.4). Odd-order 
polynomials are preferred [192, 599]. Since smooth functions are roughly locally 
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linear, higher-order local polynomial regression often offers scant advantage over the 
simpler linear fits unless the true curve has very sharp wiggles. 


11.2.3 Kernel Smoothers 


For the smoothers mentioned so far, there is a discontinuous change to the fit each time 
the neighborhood membership changes. Therefore, they tend to fit well statistically 
but exhibit visually unappealing jitters or wiggles. 

One approach to increasing smoothness is to redefine the neighborhood so that 
points only gradually gain or lose membership in it. Let K be a symmetric kernel 
centered at 0. A kernel is essentially a weighting function—in this case it weights 
neighborhood membership. One reasonable kernel choice would be the standard 
normal density, K(z) = (1/./27) exp{—z7/2}. Then let 


ey Wey Ke = xi)/h) 
Sr) = di KGa 


(11.21) 


where the smoothing parameter h is called the bandwidth. Notice that for many 
common kernels such as the normal kernel, all data points are used to calculate the 
smooth at each point, but very distant data points receive very little weight. Proximity 
increases a point’s influence on the local fit; in this sense the concept of local averaging 
remains. A large bandwidth yields a quite smooth result because the weightings of the 
data points change little across the range of the smooth. A small bandwidth ensures 
a much greater dominance of nearby points, thus producing more wiggles. 

The choice of smoothing kernel is much less important than the choice of 
bandwidth. A similar smooth will be produced from diverse kernel shapes. Although 
kernels need not be densities, a smooth, symmetric, nonnegative function with tails 
tending continuously toward zero is generally best in practice. Thus, there are few 
reasons to look beyond a normal kernel, despite a variety of asymptotic arguments 
supporting more exotic choices. 

Kernel smoothers are clearly linear smoothers. However, the computation of 
the smooth cannot be sequentially updated in the manner of the previous efficient 
approaches because the weights for all points change each time x changes. Fast Fourier 
transform methods are helpful in the special case of equally spaced data [307, 596]. 
Further background on kernel smoothing is given by [573, 581, 599, 651]. 


Example 11.5 (Easy Data, Continued) Figure 11.6 shows a kernel smooth of the 
data introduced in Example 11.1, using a normal kernel with A = 0.16 chosen by 
cross-validation. Since neighborhood entries and exits are gradual, the result exhibits 
characteristically rounded features. However, note that the kernel smoother does not 
eliminate systematic bias at the edges, as the running-line smoother does. 


11.2.4 Local Regression Smoothing 


Running polynomial smoothers and kernel smoothers share some important links 
[10, 308, 599]. Suppose that the data originated from a random design, so they are a 
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FIGURE 11.6 Plot of kernel smooth using a normal kernel with A = 0.16 chosen by cross- 
validation (solid line) and the true underlying curve (dotted line). 


random sample from the model (X;, Y;) ~ i.i.d. f(x, y). (A nonrandom design would 
have prespecified the x; values.) We may write 


f(x, y) 
FO 


where, marginally, X ~ f(x). Using the kernel density estimation approach described 
in Chapter 10 [and a product kernel for estimating f(x, y)], we may estimate 


sa) = EYA = | fOD ay = dy, (11.22) 


> 1 : x— Xi y—Y;i 
= Ky K 11.2 
fa, y) Ai ( 7 ) 1 T ) (11.23) 
and 
A 1 g x— Xi 
=— yK 11.24 
f% Ta n ) (11.24) 


for suitable kernels Ky and Ky and corresponding bandwidths hx and hy. The 


Nadaraya—Watson estimator [476, 655] of s(x) is obtained by substituting fx, y) 
and f(x) in (11.22), yielding 


A = 4 K(x — Xi)/hx) 
Sh, (x) = 3 Y; Yi K(x — Xi) /h 


i=1 


(11.25) 


Note that this matches the form of a kernel smoother [see (11.21)]. 
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It is easy to show that the Nadaraya—Watson estimator minimizes 


NOO; = Bo)’ Kx C z =) (11.26) 
i=1 S 


with respect to Bo. This is a least squares problem that locally approximates s (x) witha 
constant. Naturally, this locally constant model could be replaced with a local higher- 
order polynomial model. Fitting a local polynomial using weighted regression, with 
weights set according to some kernel function, yields a locally weighted regression 
smooth, often simply called a local regression smooth (118, 192, 651]. The pth-order 
local polynomial regression smoother minimizes the weighted least squares criterion 


n 


— Xi 
\C[ri — Bo — Bie — Xi) — +++ — Bpa xy Ke ( ) (11.27) 


i=1 x 


and can be fitted using a weighted polynomial regression at each x, with weights 
determined by the kernel K, according to the proximity to x. This is still a linear 
smoother, with a smoothing matrix composed of one row from the hat matrix used in 
each weighted polynomial regression. 

The least squares criterion may be replaced by other choices. See Section 11.4.1 
for an extension of this technique that relies on a robust fitting method. 


11.2.5 Spline Smoothing 


Perhaps you have found the graphs of smooths presented so far in this chapter to be 
somewhat unsatisfying visually because they are more wiggly than you would have 
drawn by hand. They exhibit small-scale variations that your eye easily attributes to 
random noise rather than to signal. Then smoothing splines may better suit your taste. 

Assume that the data have been sorted in increasing order of the predictor, so 
xı is the smallest predictor value and x, is the largest. Define 


0.09) = 1% = sean? +3 f "Soo? de, (11.28) 
i=l *1 


where §”’(x) is the second derivative of § (x). Then the summation constitutes a penalty 
for misfit, and the integral is a penalty for wiggliness. The parameter A controls the 
relative weighting of these two penalties. 

It is an exercise in the calculus of variations to minimize Q)(S) over all twice 
differentiable functions $ for fixed A. The result is a cubic smoothing spline, $, (x). This 
function is a cubic polynomial in each interval [x;, xj41] fori = 1,...,n — 1, with 
these polynomial pieces pasted together twice continuously differentiably at each x;. 
Although usually inadvisable in practice, smoothing splines can be defined on ranges 
extending beyond the edges of the data. In this case, the extrapolative portions of the 
smooth are linear. 

It turns out that cubic splines are linear smoothers, so §, = SY. This result is 
presented clearly in [322], and efficient computation methods are covered in [144, 
597]. Other useful references about smoothing splines include [164, 188, 280, 649]. 


11.3 COMPARISON OF LINEAR SMOOTHERS 377 


Response 


Predictor 


FIGURE 11.7 Plot of a cubic smoothing spline using A = 0.066 chosen by cross-validation 
(solid line) and the true underlying curve (dotted line). 


The ith row of S consists of weights Sj1,..., Sin, whose relationship to x; is 
sketched in Figure 11.8 (discussed in Section 11.3). Such weights are reminiscent of 
kernel smoothing with a kernel that is not always positive, but in this case the kernel 
does not retain the same shape when centered at different points. 


Example 11.6 (Easy Data, Continued) Figure 11.7 shows a spline smooth of the 
data introduced in Example 11.1, using à = 0.066 chosen by cross-validation. The 
result is a curve very similar to what you might have sketched by hand. 


11.2.5.1 Choice of Penalty Smoothing splines depend on a smoothing param- 
eter A that relates to neighborhood size less directly than for smoothers we have 
discussed previously. We have already noted that à controls the bias—variance trade- 
off, with large values of à favoring variance reduction and small values favoring bias 
reduction. As à — oo, S$, approaches the least squares line. When A = 0, $; is an 
interpolating spline that simply connects the data points. 

Since smoothing splines are linear smoothers, the span selection methods dis- 
cussed in Section 11.2.1 still apply. Calculating CVRSS,(8,) via (11.18) requires 
the Si, which can be calculated efficiently using the method in [496]. Calculating 
GCVRSS,(8,) requires tr{S}, which also can be calculated efficiently [145]. 


11.3 COMPARISON OF LINEAR SMOOTHERS 


Although the smoothers described so far may seem very different, they all rely on 
the principal of local averaging. The fit of each depends on a smoothing matrix S, 
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FIGURE 11.8 Equivalent kernels for five different linear smoothing methods for which 
tr{S} = 7. The methods are constant-span running mean (CSRM) with symmetric neighbor- 
hoods, running lines (RL) with symmetric neighborhoods, locally weighted regression (LWR), 
Gaussian kernel smoothing (K), and cubic smoothing spline (SS). The smoothing weights for 
an interior point (indicated by the vertical line) correspond to the 36th row of S. The entire 
collection of 105 values of x; is shown by hashes on the horizontal axis: They are equally 
spaced on each side, but twice as dense on the right. 


whose rows determine the weights used in a local average of the response values. 
Comparison of a typical row of S for different smoothers is a useful approach to 
understanding the differences between techniques. 

Of course, the weights in a typical row of S depend on the smoothing parameter. 
In general, values of A that favor greater smoothing will produce rows of S with a 
more diffuse allocation of weight, rather than high weights concentrated in just a 
few entries. Therefore, to enable a fair comparison, it is necessary to find a common 
link between the diverse smoothing parameters used by different techniques. The 
common basis for comparison is the number of equivalent degrees of freedom of the 
smooth, which can most simply be defined as df = tr{S} for linear smoothers. Several 
alternative definitions and extensions for nonlinear smoothers are discussed in [322]. 

For fixed degrees of freedom, the entries in a row of S are functions of the x;, 
their spacing, and their proximity to the edge of the data. If the weights in a row of S are 
plotted against the predictor values, we may view the result as an equivalent kernel, 
whose weights are analogous to the explicit weighting used in a kernel smoother. 
Figure 11.8 compares the equivalent kernel for various smoothers with 7 degrees of 
freedom. The kernel is shown for the 36th of 105 ordered predictor values, of which 
35 are equally spaced to the left and 69 are equally spaced twice as densely to the right. 
Notice that these kernels can be skewed, depending on the spacing of the x;. Also, 
kernels need not be everywhere positive. In Figure 11.8, the equivalent kernel for the 
smoothing spline assigns negative weight in some regions. Although not shown in 
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this figure, the shape of the kernels is markedly different for a point near the edge of 


the data. For such a point, weights generally increase toward the edge and decrease 
away from it. 


11.4 NONLINEAR SMOOTHERS 


Nonlinear smoothers can be much slower to calculate, and in ordinary cases they 
offer little improvement over simpler approaches. However, the simpler methods can 
exhibit very poor performance for some types of data. The loess smoother provides 
improved robustness to outliers that would introduce substantial noise in an ordinary 
smoother. We also examine the supersmoother, which allows the smoothing span to 
vary as best suits the local needs of the smoother. Such smoothing can be very helpful 
when var{Y|x} varies with x. 


11.4.1 Loess 


The loess smoother is a widely used method with good robustness properties [116, 
117]. It is essentially a weighted running-line smoother, except that each local line 
is fitted using a robust method rather than least squares. As a result, the smoother is 
nonlinear. 

Loess is fitted iteratively; let ¢ index the iteration number. To start at t = 0, we 
let d,(x;) denote the distance from x; to its kth nearest neighbor, where k (or k/n) is 
a smoothing parameter. The kernel used for the local weighting around point x; is 


Ko nel aac (11.29) 
dg(xi) 


where 


j (11.30) 
0 otherwise 


d= 1z) for |z| <1, 
A i Izl 2 
is the tricube kernel. 
The estimated parameters of the locally weighted regression for the ith point at 
iteration t are found by minimizing the weighted sum of squares 


Se — (80 + pix J) Kæ (11.31) 


j=l 


We denote these estimates as By ¡form = 0, l andi = 1, ..., n. Linear—rather than 
polynomial—regression is recommended, but the extension to polynomials would 
require only a straightforward change to (11.31). Note that the fitted value for the 
response variable given by the local regression is Y = BY. + Pixi. This completes 
iteration f. 
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FIGURE 11.9 Plot of a loess smooth using k = 30 chosen by cross-validation (solid line) 
and the true underlying curve (dotted line). 


To prepare for the next iteration, observations are assigned new weights 
based on the size of their residual, in order to downweight apparent outliers. If 
e® =Y,;- aus then define robustness weights as 


AY SB , (11.32) 
6 x median e® 
where B(z) is the biweight kernel given by 
1—2’) for |z| <1, 
B(z) = ) eI A (11.33) 
0 otherwise. 


Then the weights K;(x;) in (11.31) are replaced with it) K(x), and new locally 


weighted fits are obtained. The resulting estimates for each i provide Y CED. The 
process stops after t = 3 by default [116, 117]. 


Example 11.7 (Easy Data, Continued) Figure 11.9 shows a loess smooth of the 
data introduced in Example 11.1, using k = 30 chosen by cross-validation. The results 
are very similar to the running-line smooth. 

Figure 11.10 shows the effect of outliers. The dotted line in each panel is 
the original smooth for loess and running lines; the solid lines are the result when 
three additional data points at (1, —8) are inserted in the dataset. The spans for each 
smoother were left unchanged. Loess was so robust to these outliers that the two 
curves are nearly superimposed. The running-line smoother shows greater sensitivity 
to the outliers. 
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FIGURE 11.10 Plot of running-line smooths (left) using k = 23, and loess smooths (right) 
using k = 30. In each panel, the dotted line is the smooth of the original data, and the solid line 
is the smooth after inserting in the dataset three new outlier points at (1,—8). 


11.4.2 Supersmoother 


All of the previous methods employ a fixed span. There are cases, however, where a 
variable span may be more appropriate. 


Example 11.8 (Difficult Data) Consider the curve and data shown in Figure 11.11. 
These data are available from the website for this book. Suppose that the true condi- 
tional mean function for these data is given by the curve; thus the goal of smoothing 
is to estimate the curve using the observed data. The curve is very wiggly on the right 
side of the figure, but these wiggles could reasonably be detected by a smoother with 
a suitably small span because the variability in the data is very low. On the left, the 
curve is very smooth, but the data have much larger variance. Therefore, a large span 
would be needed in this region to smooth the noisy data adequately. Thus a small 
span is needed to minimize bias in one region, and a large span is needed to control 
variance in another region. The supersmoother [205, 208] was designed for this sort 
of problem. 


The supersmoothing approach begins with the calculation of m different 
smooths, which we denote 5)(x),..., m(x), using different spans, say h1,..., hm. 
For m = 3, spans of hy = 0.05n, h2 = 0.2n, and h3 = 0.5n are recommended. Each 
smooth should be computed over the full range of the data. For simplicity, use the 
running-line smoother to generate the $;(x) for j = 1, 2, and 3. Figure 11.12 shows 
the three smooths. 
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FIGURE 11.11 These bivariate data with nonconstant variance and wiggles with changing 
frequency and amplitude would be fitted terribly by most fixed-span smoothers. The true E{Y|x} 
is shown with the solid line. 
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FIGURE 11.12 The three preliminary fixed-span smooths employed by the supersmoother. 
The spans are 0.05n (dashed), 0.2n (dotted), and 0.5n (solid). The data points have been faded 
to enable a clearer view of the smooths. 
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FIGURE 11.13 The p(h;, x;) for j = 1 (dashed), 2 (dotted), and 3 (solid). For each j, the 
curve is a smooth of the absolute cross-validated residuals. 


Next, define p(hj;, x) to be a measure of performance of the jth smooth at 
point x, for j = 1,...,m. Ideally, we would like to assess the performance at point 
x; according to E{g(Y — s(x) |X = xh where g is a symmetric function that 


penalizes large deviations, and 5 (x)) is the jth smooth at x; estimated using the 
cross-validation dataset that omits x;. This expected value is of course unknown, so, 
following the local-averaging paradigm, we estimate it as 


Pthj, xi) = $ (s = MPD) (11.34) 


where S* is some fixed-span smoother. For the implementation suggested in [205], 
§* = $2 and g(z) = |z|. Figure 11.13 shows the smoothed absolute cross-validated 
residuals |Y; — x,)| for the three different smooths. The curves in this figure rep- 
resent p(hj;, xi) for j = 1, 2, and 3. The data used in each smooth originate as resid- 
uals from alternative smooths using spans of 0.05n (dashed), 0.2n (dotted), and 0.5n 
(solid), but each set of absolute residuals is smoothed with a span of 0.2n to generate 
the curves shown. 

Ateach x;, the performances of the three smooths can be assessed using p(h j, xi) 
for j = 1,2, and 3. Denote by h; the best of these spans at x;, that is, the particular 
span among h1, h2, and h3 that provides the lowest p(hj;, xi). Figure 11.14 plots h; 
against x; for our example. The best span can vary abruptly even for adjacent x;, so 
next the data in Figure 11.14 are passed through a fixed-span smoother (say, $2) to 
estimate the optimal span as a function of x. Denote this smooth as h(x). Figure 11.14 
also shows h(x). 

Now we have the original data and a notion of the best span to use for any given 
x: namely h(x). What remains is to create a final, overall smooth. Among several 
strategies that might be employed at this point, [205] recommends setting S(x;) equal 
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FIGURE 11.14 Estimate of the optimal span as a function of x. The points correspond to 
(Xi, h;). A smooth of these points, namely h(x), is shown with the curve. 


to a linear interpolation between $} -(x;) (xi) and 5;,+,(,,)(xi), where among the m fixed 
spans tried, h~ (x;) is the largest span less than h(x;) and h*(x;) is the smallest span 


that exceeds h(x;). Thus, 
ht (xi) — hi) 
h* (xi) — h~ (xi) 


h(xi) — h7 (xi) 
ht (xi) — h- (xi) 


A 


§(xj) = Sit (xi) + Sn-(x;)(%i)- (11.35) 
Figure 11.15 shows the final result. The supersmoother adjusted the span wisely, 
based on the local variability of the data. In comparison, the spline smooth shown in 
this figure undersmoothed the left side and oversmoothed the right side, for a fixed A 
chosen by cross-validation. 
Although the supersmoother is a nonlinear smoother, it is very fast compared 
to most other nonlinear smoothers, including loess. 


11.5 CONFIDENCE BANDS 


Producing reliable confidence bands for smooths is not straightforward. Intuitively, 
what is desired is an image that portrays the range and variety of smooth curves 
that might plausibly be obtained from data like what we observed. Bootstrapping 
(Chapter 9) provides a method for avoiding parametric assumptions, but it does not 
help clarify exactly what sort of region should be graphed. 

Consider first the notion of a pointwise confidence band. Bootstrapping the 
residuals would proceed as follows. Let e denote the vector of residuals [so e = 
(I — S)Y for a linear smoother]. Sample the elements of e with replacement to gen- 
erate bootstrapped residuals e*. Add these to the fitted values to obtain bootstrapped 
responses Y* = T + e*. Smooth Y* over x to generate a bootstrapped fitted smooth, 
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FIGURE 11.15 Supersmoother fit (solid line). A spline smoother fit (with à chosen by cross- 
validation) is also shown (dotted line). 


§*. Start anew and repeat the bootstrapping many times. Then, for each x in the dataset, 
a bootstrap confidence interval for $ (x) can be generated using the percentile method 
(Section 9.3.1) by deleting the few largest and smallest bootstrap fits at that point. If 
the upper bounds of these pointwise confidence intervals are connected for each x, 
the result is a band lying above §(x). A plot showing this upper band along with the 
corresponding lower band provides a visually appealing confidence region. 

Although this approach is appealing, it can be quite misleading. First, the confi- 
dence bands are composed of pointwise confidence intervals with no adjustment made 
for simultaneous inference. To correct the joint coverage probability to 95%, each of 
the individual intervals would need to represent much more than 95% confidence. 
The result would be a substantial widening of the pointwise bands. 

Second, the pointwise confidence bands are not informative about features 
shared by all smooths supported by the data. For example, all the smooths could 
have an important bend at the same point, but the pointwise confidence region would 
not necessarily enforce this. It would be possible to sketch smooth curves lying 
entirely within the pointwise region that do not have such a bend, or even ones that 
have a reverse bend at that point. Similarly, suppose that all the smooths show gener- 
ally the same curved shape and a linear fit is significantly inferior. If the confidence 
bands are wide or if the curve is not too severe, it will be possible to sketch a linear 
fit that lies entirely within the bands. In this case, the pointwise bands fail to convey 
an important aspect of the inference: that a linear fit should be rejected. 


Example 11.9 (Confidence Bands) The shortcomings of pointwise confidence 
bands are illustrated in Figure 11.16, for some data for which the true conditional 
mean function is E{Y¥|x} = x?. The smoothing span for a running-line smoother 
was chosen by cross-validation, and a pointwise 95% confidence band is shown by 
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FIGURE 11.16 Running-line smooth of some data for which E{Y |x} = x, with span chosen 
by cross-validation. The shaded region represents a pointwise 95% confidence band as described 
in the text. Note that the line Y = 0 is contained entirely within the band. The dotted lines show 
the result of the expansion approach described in Example 11.9. 


the shaded region. Note that the band widens appropriately near the edges of the 
data to reflect the increased uncertainty of smoothing in these regions with fewer 
neighboring observations. Unfortunately, the null model E{Y|x} = 0 lies entirely 
within the pointwise confidence band. In Example 11.11 we introduce an alternative 
method that does not support the null model. 


The failure of pointwise confidence bands to capture the correct joint coverage 
probability can be remedied in several ways. One straightforward approach begins 
by writing the ordinary pointwise confidence bands as ($ (x) — L(x), §(x) + U(x), 
where L(x) and U(x) denote how far the lower and upper pointwise confidence bounds 
deviate from $ (x) at the point x. Then the confidence bands can be expanded outwards 
by finding the smallest w for which the bands given by ($ (x) — wL(x), §(x) + @U(x)) 
contain at least (1 —a@)100% of the bootstrap curves in their entirety, where 
(1 — w)100% is the desired confidence level. Alternatively, $ (x) can be replaced 
by the pointwise median bootstrap curve; for hypothesis testing use the pointwise 
median null band. A second, rough approach is simply to additively shift the point- 
wise bounds outwards until the desired percentage of bootstrap curves are entirely 
contained within. 


Example 11.10 (Confidence Bands, Continued) Applying this approach to Ex- 
ample 11.9, we find œw = 1.61. The resulting bands are shown as dotted lines in 
Figure 11.16. This method improves the joint coverage probability. 


The failure of pointwise confidence bands to represent accurately the shape 
of the bootstrap confidence set cannot be blamed on the pointwise nature of the 
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Response 


—0.50 —0.25 0 0.25 0.50 


Predictor 


FIGURE 11.17 Twenty bootstrap smooths of the data in Figure 11.16 for which V* was 
within the central 95% region of its bootstrap distribution; see Example 11.11. 


bands; rather it is caused by the attempt to reduce a n-dimensional confidence set to a 
two-dimensional picture. Even if a band with correct joint coverage were used, the 
same problems would remain. For that reason, it may be more reasonable to super- 
impose a number of smooth curves that are known to belong to the joint confidence 
set, rather than trying to plot the boundaries of the set itself. With this in mind, we 
now describe a second bootstrapping approach suitable for linear smoothers. 

Assume the response variable has constant variance. Among the estimators of 
this variance, Hastie and Tibshirani [322] recommend 


32- RSS, (81) 
n — 2tr{S} + tr{SS7}’ 


(11.36) 


where §, and RSS, represent the estimated smooth using span à and the resulting 
residual squared error. The quantity 


V = (6, —s)'(6?88") '@, —s) (11.37) 


is approximately pivotal, so its distribution is roughly independent of the true un- 
derlying curve. Bootstrap the residuals as above, each time computing the vector of 
bootstrap fits, §*, and the corresponding value 


V* = (8% —8,)"(6*?88T)~' (6% — 81). (11.38) 


Use the collection of V* values to construct the empirical distribution of V*. Eliminate 
those bootstrap fits whose V* values are in the upper tail of this empirical distribution. 
Graph the remaining smooths—or a subset of them—superimposed. This provides a 
useful picture of the uncertainty of a smooth. 
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Xə 


Xi Xi 
FIGURE 11.18 The left panel shows data scattered around the time-parameterized curve 
given by (x(t), y(t)) = ((1 — cos t) cos t, (1 — cos T) sin T) for t € [0, 37/2], which is shown 
by the solid line. The dotted line shows the result of a fifth-order polynomial regression of X3 
on X;, and the dashed line shows the result of a fifth-order polynomial regression of X; on 
X2. The right panel shows a principal curve smooth (solid) for these data, along with the true 
curve (dotted). These are nearly superimposed. 


Example 11.11 (Confidence Bands, Continued) Applying the above method to 
the data described in Example 11.9 with arunning-line smoother yields Figure 11.17. 
Roughly the same pointwise spread is indicated by this figure as is shown in the 
pointwise bands of Figure 11.16, but Figure 11.17 confirms that the smooth is curved 
like y = x’. In fact, from 1000 bootstrap iterations, only three smooths resembled 
functions with a nonpositive second derivative. Thus, this bootstrap approach strongly 
rejects the null relationship Y = 0, while the pointwise confidence bands fail to rule 
it out. 


A variety of other bootstrapping and parametric approaches for assessing the 
uncertainty of results from smoothers are given in [192, 309, 322, 436]. 


11.6 GENERAL BIVARIATE DATA 


For general bivariate data, there is no clear distinction of predictor and response vari- 
ables, even though the two variables may exhibit a strong relationship. It is therefore 
more reasonable to label the variables as X; and X2. As an example of such data, 
consider the two variables whose scatterplot is shown in Figure 11.18. For this ex- 
ample, the curve to be estimated corresponds to a curvilinear ridgetop of the joint 
distribution of X; and X2. 

Arbitrary designation of one variable as predictor and the other as response 
is counterproductive for such problems. For example, the left panel of Figure 11.18 
shows the two fits obtained by ordinary fifth-order polynomial regression. Each of 
these lines is fitted by minimizing a set of residuals that measure the distances between 
points and the fitted curve, parallel to the axis of response. In one case X; was treated 
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as the response, and in the other case X2 was. Very different answers result, and in 
this case they are both terrible approximations to the true relationship. 

The right panel of Figure 11.18 shows another curve fit to these data. Here, the 
curve was chosen to minimize the orthogonal distances between the points and the 
curve, without designation of either variable as the response. This approach corre- 
sponds to the local-averaging notion that points in any local neighborhood should fall 
near the curve. One approach to formalizing this notion is presented in Section 12.2.1, 
where we discuss the principal curves method for smoothing general p-dimensional 
data when there is no clear distinction of predictor and response. Setting p = 2 ac- 
commodates the bivariate case shown here. 


PROBLEMS 


11.1. Generate 100 random points from the following model: X ~ Unif(0, x) and Y = 
e(X)+e with independent «|x ~ N(0, g(x)?/64), where g(x) = 1 + sin{x?}/x?. 
Smooth your data with a constant-span (symmetric nearest neighbor) running-mean 
smoother. Select a span of 2k + 1 for 1 < k < 11 chosen by cross-validation. Does a 
running-median smooth with the same span seem very different? 


11.2. Use the data from Problem 11.1 to investigate kernel smoothers as described below: 
a. Smooth the data using a normal kernel smoother. Select the optimal standard 
deviation of the kernel using cross-validation. 


b. Define the symmetric triangle distribution as 


0 if |x-— u| >h, 
farm, h)=< Œ- u+h)/@ ifu-h<x< p, 
(uth—x/a ifu<x<u+h. 


The standard deviation of this distribution is a/v/6. Smooth the data using a sym- 
metric triangle kernel smoother. Use cross-validation to search the same set of 
standard deviations used in the first case, and select the optimum. 


c. Let 
X h) = — 2. 1 2 
fC 3M, ) c(1 cos{ xz log{|z| + 1}) exp { — oh y 


where z = (x — u)/h and c is a constant. Plot this density function. The standard 
deviation of this density is about 0.904. Smooth the data using a kernel smoother 
with this kernel. Use cross-validation to search the same set of standard deviations 
used previously, and select the optimum. 


d. Compare the smooths produced using the three kernels. Compare their CVRSS 
values at the optimal spans. Compare the optimal spans themselves. For kernel 
smoothers, what can be said about the relative importance of the kernel and the 
span? 


11.3. Use the data from Problem 11.1 to investigate running lines and running polynomial 
smoothers as described below: 


390 


11.4. 


11.5. 


11.6. 
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a. Smooth the data using a running-line smoother with symmetric nearest neighbors. 
Select a span of 2k + 1 for 1 < k < 11 chosen by cross-validation. 


b. Repeat this process for running local polynomial smoothers of degree 3 and 5; 
each time choose the optimal span using cross-validation over a suitable range 
for k. (Hints: You may need to orthogonalize polynomial terms; also reduce the 
polynomial degree as necessary for large spans near the edges of the data.) 


c. Comment on the quality and characteristics of the three smooths (local linear, 
cubic, and quintic). 

d. Does there appear to be a relationship between polynomial degree and optimal 
span? 


e. Comment on the three plots of CVRSS. 


The book website provides data on the temperature—pressure profile of the Martian 
atmosphere, as measured by the Mars Global Surveyor spacecraft in 2003 using a 
radio occultation technique [638]. Temperatures generally cool with increasing plan- 
etocentric radius (altitude). 


a. Smooth temperature as a function of radius using a smoothing spline, loess, and at 
least one other technique. Justify the choice of span for each procedure. 


b. The dataset also includes standard errors for the temperature measurements. Apply 
reasonable weighting schemes to produce weighted smooths using each smoother 
considered in part (a). Compare these results with the previous results. Discuss. 


c. Construct confidence bands for your smooths. Discuss. 


d. These data originate from seven separate orbits of the spacecraft. These orbits pass 
over somewhat different regions of Mars. A more complete dataset including orbit 
number, atmospheric pressure, longitude, latitude, and other variables is available 
in the file mars-all.dat at the book website. Introductory students may smooth 
some other interesting pairs of variables. Advanced students may seek to improve 
the previous analyses, for example, by adjusting for orbit number or longitude and 
latitude. Such an analysis might include both parametric and nonparametric model 
components. 


Reproduce Figure 11.8. (Hint: The kernel for a spline smoother can be reverse- 
engineered from the fit produced by any software package, using a suitable vector 
of response data.) 


a. Create a graph analogous to Figure 11.8 for smoothing at the second smallest 
predictor value. Compare this with the first graph. 


b. Graphically compare the equivalent kernels for cubic smoothing splines for differ- 
ent x; and i. 


Figure 11.19 shows the pressure difference between two sensors on a steel plate 
exposed to a powerful air blast [342]. There are 161 observations during a period 
just before and after the blast. The noise in Figure 11.19 is attributable to inadequate 
temporal resolution and to error in the sensors and recording equipment; the underlying 
physical shock waves that generate these data are smooth. These data are available 
from the website for this book. 


a. Construct a running-line smooth of these data, with span chosen by eye. 
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FIGURE 11.19 Data on air blast pressure difference for Problem 11.6. 


b. Make aplot of CVRSS,(S;,) versus k fork € {3, 5,7, 11, 15, 20, 30, 50}. Comment. 


c. Produce the most appealing smooth you can for these data, using any smoother 
and span you wish. Why do you like it? 


d. Comment on difficulties of smoothing and span selection for these data. 


11.7. Using the data from Problem 11.6 and your favorite linear smoothing method for 
these data, construct confidence bands for the smooth using each method described 
in Section 11.5. Discuss. (Using a spline smoother is particularly interesting.) 


CHAPTER 1 2 


MULTIVARIATE SMOOTHING 


12.1 PREDICTOR-RESPONSE DATA 


Multivariate predictor-response smoothing methods fit smooth surfaces to obser- 
vations (x;, yj), where x; is a vector of p predictors and y; is the corresponding 


response value. The y;,..., Yn values are viewed as observations of the random vari- 
ables Yj, ..., Yn, where the distribution of Y; depends on the ith vector of predictor 
variables. 


Many of the bivariate smoothing methods discussed in Chapter 11 can be gen- 
eralized to the case of several predictors. Running lines can be replaced by running 
planes. Univariate kernels can be replaced by multivariate kernels. One generalization 
of spline smoothing is thin plate splines [280, 451]. In addition to the significant com- 
plexities of actually implementing some of these approaches, there is a fundamental 
change in the nature of the smoothing problem when using more than one predictor. 

The curse of dimensionality is that high-dimensional space is vast, and points 
have few near neighbors. This same problem was discussed in Section 10.4.1 as it 
applied to multivariate density estimation. Consider a unit sphere in p dimensions with 
volume ?/* / T(p/2 + 1). Suppose that several p-dimensional predictor points are 
distributed uniformly within the ball of radius 4. In one dimension, 25% of predictors 
are expected within the unit ball; hence unit balls might be reasonable neighborhoods 
for smoothing. Table 12.1 shows that this proportion vanishes rapidly as p increases. 
In order to retain 25% of points in a neighborhood when the full set of points lies in 
a ball of radius 4, the neighborhood ball would need to have radius 3.73 if p = 20. 
Thus, the concept of local neighborhoods is effectively lost. 

The curse of dimensionality raises concerns about the effectiveness of 
smoothers for multivariate data. Effective local averaging will require a large number 
of points in each neighborhood, but to find such points, the neighborhoods must stretch 
over most of predictor space. A variety of effective multivariate surface smoothing 
methods are described in [322, 323, 573]. 

There is a rich set of smoothing methods developed for geostatistics and spatial 
statistics that are suitable for two and three dimensions. In particular, kriging methods 
offer a more principled foundation for inference than many of the generic smoothers 
considered here. We do not consider such methods further, but refer readers to books 
on spatial statistics such as [128, 291]. 
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TABLE 12.1 Ratio of volume of unit 
sphere in p dimensions to volume of 
sphere with radius 4. 


p Ratio 

1 0.25 

2 0.063 

3 0.016 

4 0.0039 

5 0.00098 

10 9.5 x 1077 
20 9.1 x 10-8 


100 6.2 x 1076! 


12.1.1 Additive Models 


Simple linear regression is based on the model E{Y|x} = Bo + Bix. Nonparamet- 
ric smoothing of bivariate predictor-response data generalizes this to E{Y |x} = s(x) 
for a smooth function s. Now we seek to extend the analogy to the case with p 
predictors. Multiple regression uses the model E{Y|x} = Bo + Da 1 kxk where 


KX = (X],...,X pe The generalization for smoothing is the additive model 
p 
E(Y |x} =a + $ skar), (12.1) 
k=1 


where sg is a smooth function of the kth predictor variable. Thus, the overall model 
is composed of univariate effects whose influence on the mean response is additive. 
Fitting such a model relies on the relationship 


“Gays E(Y ay sjap|x}, (12.2) 
j+k 

where x, is the kth component in x. Suppose that we wished to estimate sg at x%, 
and that many replicate values of the kth predictor were observed at exactly this x7. 
Suppose further that all the s; (j # k) were known except s. Then the expected 
value on the right side of (12.2) could be estimated as the mean of the values of 
Y;-a—-—)>> TER s j(xij) corresponding to indices i for which the ith observation of 
the kth predictor satisfies xig = Xp. For actual data, however, there will likely be no 
such replicates. This problem can be overcome by smoothing: The average is taken 
over points whose kth coordinate is in a neighborhood of x;. A second problem— 
that none of the sj are actually known—is overcome by iteratively cycling through 
smoothing steps based on isolations like (12.2) updating są using the best current 

guesses for all s; for j # k. 
The iterative strategy is called the backfitting algorithm. LetY = (Y1,..., Yn)", 
and for each k, let gi) denote the vector of estimated values of 5;(x;x) at iteration t for 
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i=1,...,n. These n-vectors of estimated smooths at each observation are updated 
as follows: 
1. Let & be the n-vector (Y,..., Y)'. Some other generalized average of the re- 


sponse values may replace the sample mean Y. Let t = 0, where t indexes the 
iteration number. 


2. Let a represent initial guesses for coordinatewise smooths evaluated at the 


observed data. A reasonable initial guess is to let gi) = (Êkxik, wees Bexnk)* 
fork = 1,..., p, where the Be are the linear regression coefficients found when 
Y is regressed on the predictors. 
3. Fork = 1,..., pin turn, let 
get) 
= smooth, (rx), (12.3) 
where 


m=Y-a— Sat) Sos) (12.4) 


J<k j>k 


and smooth;(r,) denotes the vector obtained by smoothing the elements of 
rg against the kth coordinate values of predictors, namely x14, ..., Xnx, and 
evaluating the smooth at each xig. The smoothing technique used for the kth 
smooth may vary with k. 

4. Increment ¢ and go to step 3. 


The algorithm can be stopped when none of the § g(t ) 


P 
gt) _ gt) (a — 5°) 8) TO 
E-P E-P) DO 


k=1 


change very much—perhaps when 


is very small. 

To understand why this algorithm works, recall the Gauss-Seidel algorithm for 
solving a linear system of the form Az = b for z given a matrix A and a vector b of 
known constants (see Section 2.2.5). The Gauss-Seidel procedure is initialized with 
starting value zo. Then each component of z is solved for in turn, given the current 
values for the other components. This process is iterated until convergence. 

Suppose only linear smoothers are used to fit an additive model, and let S; be 
the n x n smoothing matrix for the kth component smoother. Then the backfitting 


algorithm solves the set of equations given by $j = Sx (x -5 TAk ŝ i) . Writing this 


set of equations in matrix form yields 


ISS.. S 


S I So ... S 
= ; ; (12.5) 
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which is of the form Az = b where z = (81, 89,..., ŝp)T = §. Note that b = AY 
where A is a block-diagonal matrix with the individual S; matrices along the diagonal. 
Since the backfitting algorithm sequentially updates each vector §; as a single block, it 
is more formally a block Gauss—Seidel algorithm. The iterative backfitting algorithm 
is preferred because it is faster than the direct approach of inverting A. 

We now turn to the question of convergence of the backfitting algorithm and the 
uniqueness of the solution. Here, it helps to revisit the analogy to multiple regression. 
Let D denote the n x p design matrix whose ith row is xT so D = (xı, Teds Xn) 
Consider solving the multiple regression normal equations D'DB = D'Y for £. The 
elements of 8 are not uniquely determined if any predictors are linearly dependent, 
or equivalently, if the columns of DTD are linearly dependent. In that case, there 
would exist a vector y such that D'Dy = 0. Thus, if B were a solution to the normal 
equations, Br cy would also be a solution for any c. 

Analogously, the backfitting estimating equations Aŝ = AY will not have a 
unique solution if there exists any y such that Ay = 0. Let Zg be the space spanned 
by vectors that pass through the kth smoother unchanged. If these spaces are linearly 
dependent, then there exist y, € Zg such that ya y, = 0. In this case, Ay = 0, 
where y= (yj, ¥2,---.Y DE and therefore there is not a unique solution (see 
Problem 12.1). 

A more complete discussion of these issues is provided by Hastie and Tibshirani 
[322], from which the following result is derived. Let the p smoothers be linear and 
each S; be symmetric with eigenvalues in [0, 1]. Then Ay = 0 if and only if there 
exist linearly dependent y, € Zx that pass through the kth smoother unchanged. In 
this case, there are many solutions to Aŝ = AY, and backfitting converges to one 
of them, depending on the starting values. Otherwise, backfitting converges to the 
unique solution. 

The flexibility of the additive model is further enhanced by allowing the additive 
components of the model to be multivariate and by allowing different smoothing 
methods for different components. For example, suppose there are seven predictors, 
X1,...,X7, where x1 is a discrete variable with levels 1, ..., c. Then an additive model 
to estimate E{Y|x} might be fitted by backfitting: 


c—1 
& + XO ilii +32) + P3) + xa, x5) + flrs, x7), (12.6) 
i=l 


where the Ly permit a separate additive effect for each level of X1, S(x2) is a spline 
smooth over x2, Ĥ(x3) is a cubic polynomial regression on x3, f(x4, x5) is a recur- 
sively partitioned regression tree from Section 12.1.4, and fxe, x7) is a bivariate 
kernel smooth. Grouping several predictors in this way provides coarser blocks in the 
blockwise implementation of the Gauss-Seidel algorithm. 


Example 12.1 (Norwegian Paper) We consider some data from a paper plant in 
Halden, Norway [9]. The response is a measure of imperfections in the paper, and 
there are two predictors. (Our Y, xı, and x2 correspond to 16 — Y5, X1, and X3, 
respectively, in the author’s original notation.) The left panel of Figure 12.1 shows 
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FIGURE 12.1 Linear (left) and additive (right) models fitted to the Norwegian paper data in 
Example 12.1. 


the response surface fitted by an ordinary linear model with no interaction. The right 
panel shows an additive model fitted to the same data. The estimated $ are shown 
in Figure 12.2. Clearly x; has a nonlinear effect on the response; in this sense the 
additive model is an improvement over the linear regression fit. 


12.1.2 Generalized Additive Models 


Linear regression models can be generalized in several ways. Above, we have replaced 
linear predictors with smooth nonlinear functions. A different way to generalize linear 
regression is in the direction of generalized linear models [446]. 

Suppose that Y|x has a distribution in the exponential family. Let u = E{Y|x}. 
A generalized linear model assumes that some function of u is a linear function 
of the predictors. In other words, the model is g(u) = œ + ae 1 Pkxx, Where g 
is the link function. For example, the identity link g(u)= u is used to model 
a Gaussian distributed response, g(u) = log u is used for log-linear models, and 
g(t) = log{u/(1 — u)} is one link used to model Bernoulli data. 

Generalized additive models (GAMs) extend the additive models of Sec- 
tion 12.1.1 in a manner analogous to how generalized linear models extend linear 
models. For response data in an exponential family, a link function g is chosen, and 
the model is 


P 
g) =a +Y sx), (12.7) 


k=1 


where sx is a smooth function of the kth predictor. The right-hand side of (12.7) is 
denoted n and is called the additive predictor. GAMs provide the scope and diversity 
of generalized linear models, with the additional flexibility of nonlinear smooth effects 
in the additive predictor. 

For generalized linear models, estimation of u = E{Y|x} proceeds via iter- 
atively reweighted least squares. Roughly, the algorithm proceeds by alternating 
between (i) constructing adjusted response values and corresponding weights, and 
(ii) fitting a weighted linear regression of the adjusted response on the predictors. 
These steps are repeated until the fit has converged. 

Specifically, we described in Section 2.2.1.1 how the iteratively reweighted least 
squares approach for fitting generalized linear models in exponential families is in fact 
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Ly v2 
FIGURE 12.2 The smooths 5$;(x;) fitted with an additive model for the Norwegian paper 


data in Example 12.1. The points are partial residuals as given on the right-hand side of (12.3), 
namely, §;,(x;x) plus the overall residual from the final smooth. 


the Fisher scoring approach. The Fisher scoring method is ultimately motivated by a 
linearization of the score function that yields an updating equation for estimating the 
parameters. The update is achieved by weighted linear regression. Adjusted responses 
and weights are defined as in (2.41). The updated parameter vector consists of the 
coefficients resulting from a weighted linear least squares regression for the adjusted 
responses. 

For fitting a GAM, weighted linear regression is replaced by weighted smooth- 
ing. The resulting procedure, called local scoring, is described below. First, let u; be 
the mean response for observation i, so u; = E{Y;|x;} = g7! ni), where n; is called 
the ith value of the additive predictor; and let V(u;) be the variance function, namely, 
var{Y;|x;} expressed as a function of u;. The algorithm proceeds as follows: 


1. Initialize the algorithm at £= 0. Set &® = g(¥) and 6(.) =0 for k = 


1,..., p. This also initializes the additive predictor values 1 = VO + 
yy 8 xix) and the fitted values Ao = g(a) corresponding to each 
observation. 
2. Fori=1,...,n, construct adjusted response values 
d -1 
1 X ~ H 
D AO (x; = ji) (4 ; (12.8) 
n n=? 
3. Fori = 1,...,n, construct the corresponding weights 
d i 1 
gos (v (aP)) , (12.9) 
dn n= 


4. Use a weighted version of the backfitting algorithm from Section 12.1.1 to es- 
timate new additive predictors, SE, In this step, a weighted additive model 
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(t+ 


of the form (12.7) is fitted to the adjusted response values z; D with weights 


wt), yielding 3% tP (xi) for i = 1,...,n and k =1,..., p. This step, de- 
scribed further below, also allows calculation of new qty and poe 


5. Compute a convergence criterion such as 
n 


P p 
5 (E = Paw) IY (Paw) E (12.10) 
k= 


1 i=1 k=1 i=1 


n 


and stop when it is small. Otherwise, go to step 2. 


To revert to a standard generalized linear model, the only necessary change would be 
to replace the smoothing in step 4 with weighted least squares. 

The fitting of a weighted additive model in step 4 requires weighted smooth- 
ing methods. For linear smoothers, one way to introduce weights is to multiply the 
elements in the ith column of S by wit) for each i, and then standardize each row 
so it sums to 1. There are other, more natural approaches to weighting some linear 
smoothers (e.g., splines) and nonlinear smoothers. Further details about weighted 
smooths and local scoring are provided in [322, 574]. 

As with additive models, the linear predictor in GAMs need not consist solely 
of univariate smooths of the same type. The ideas in Section 12.1.1 regarding more 
general and flexible model building apply here too. 


Example 12.2 (Drug Abuse) The website for this book provides data on 575 pa- 
tients receiving residential treatment for drug abuse [336]. The response variable is 
binary, with Y = 1 for a patient who remained drug-free for one year and Y = 0 oth- 
erwise. We will examine two predictors: number of prior drug treatments (x1) and age 
of patient (x2). A simple generalized additive model is given by Y;|x; ~ Bernoulli(z;) 
with 


log { 1 = \ =a + Bisi(xi1) + B252(xi2). (12.11) 


Spline smoothing was used in step 4 of the fitting algorithm. Figure 12.3 shows the 
fitted response surface graphed on the probability scale. Figure 12.4 shows the fitted 
smooths 5S, on the logit scale. The raw response data are shown by hash marks along 
the bottom (y; = 0) and top (y; = 1) of each panel. 


12.1.3 Other Methods Related to Additive Models 


Generalized additive models are not the only way to extend the additive model. Several 
other methods transform the predictors or response in an effort to provide a more 
effective model for the data. We describe four such approaches below. 


12.1.3.1 Projection Pursuit Regression Additive models produce fits com- 
posed of p additive surfaces, each of which has a nonlinear profile along one 
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FIGURE 12.3 The fit of a generalized additive model to the drug abuse data described in 


Example 12.2. The vertical axis corresponds to the predicted probability of remaining drug 
free for one year. 


coordinate axis while staying constant in orthogonal directions. This aids interpreta- 
tion of the model because each nonlinear smooth reflects the additive effect of one 
predictor. However, it also limits the ability to fit more general surfaces and interac- 
tion effects that are not additively attributable to a single predictor. Projection pursuit 
regression eliminates this constraint by allowing effects to be smooth functions of 
univariate linear projections of the predictors [209, 380]. 

Specifically, these models take the form 


M 
E{Y|x} =a + X sy(azx), (12.12) 
k=1 
where each term alx is a one-dimensional projection of the predictor vector x = 
(X1,...,% pi: Thus each sg has a profile determined by sg along ag and is constant 
in all orthogonal directions. In the projection pursuit approach, the smooths są and 
the projection vectors a; are estimated for k = 1,..., M to obtain the optimal fit. For 
sufficiently large M, the expression in (12.12) can approximate an arbitrary continuous 
function of the predictors [161, 380]. 
To fit such a model, the number of projections, M, must be chosen. When 
M > 1, the model contains several smooth functions of different linear combinations 
al x. The results may therefore be very difficult to interpret, notwithstanding the 
model’s usefulness for prediction. Choosing M is a model selection problem akin to 
choosing the terms in a multiple regression model, and analogous reasoning should 
apply. One approach would be to fit a model with small M first, then repeatedly add 
the most effective next term and refit. A sequence of models is thus produced, until 
no further additional term substantially improves the fit. 
For a given M, fitting (12.12) can be carried out using the following algorithm: 


1. Begin with m = 0 by setting â = Y. 
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FIGURE 12.4 The smooth functions 5; fitted with a generalized additive model for the drug- 
abuse data in Example 12.2. The raw response data are shown via the hash marks along the 
bottom (y; = 0) and top (y; = 1) of each panel at the locations observed for the corresponding 
predictor variable. 


2. Increment m. Define the current working residual for observation i as 


m—1 


r™ =Y,-a—- 5 5 (al x;) (12.13) 
k=1 
fori = 1,...,n, where the summation vanishes if m = 1. These current resid- 


uals will be used to fit the mth projection. 


3. For any p-vector a and smoother sm, define the goodness-of-fit measure 
n (ADe (Ty \* 
Xi (r — m (a x) 
2 
(m) 
Ka (r i ) 


4. For a chosen type of smoother, maximize Q(a) with respect to a. This provides 
am and S,,. If m = M, stop; otherwise go to step 2. 


Qa) =1 (12.14) 


Example 12.3 (Norwegian Paper, Continued) We return to the Norwegian paper 
data of Example 12.1. Figure 12.5 shows the response surface fitted with projection 
pursuit regression for M = 2. A supersmoother (Section 11.4.2) was used for each 
projection. The fitted surface exhibits some interaction between the predictors that is 
not captured by either model shown in Figure 12.1. An additive model was not wholly 
appropriate for these predictors. The heavy lines in Figure 12.5 show the two linear 
directions onto which the bivariate predictor data were projected. The first projection 
direction, labeled a]x, is far from being parallel to either coordinate axis. This allows 
a better fit of the interaction between the two predictors. The second projection very 
nearly contributes an additive effect of xı. To further understand the fitted surface, we 
can examine the individual $, which are shown in Figure 12.6. These effects along 
the selected directions provide a more general fit than either the regression or the 
additive model. 
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FIGURE 12.5 Projection pursuit regression surface fitted to the Norwegian paper data for 
M =2, as described in Example 12.3. 


In addition to predictor—response smoothing, the ideas of projection pursuit have 
been applied in many other areas, including smoothing for multivariate response data 
[9] and density estimation [205]. Another approach, known as multivariate adaptive re- 
gression splines (MARS), has links to projection pursuit regression, spline smoothing 
(Section 11.2.5), and regression trees (Section 12.1.4.1) [207]. MARS may perform 
very well for some datasets, but recent simulations have shown less promising results 
for high-dimensional data [21]. 


12.1.3.2 Neural Networks Neural networks are a nonlinear modeling method 
for either continuous or discrete responses, producing a regression or classification 
model [50, 51, 323, 540]. For a continuous response Y and predictors x, one type of 
neural network model, called the feed-forward network, can be written as 


M 
8) = Bot X Bm f(X + Ym), (12.15) 

m=1 
where fo, Bm, &m, and Ym form =1,..., M are estimated from the data. We can 
think of the falx + Ym) for m = 1,..., M as being analogous to a set of basis 


functions for the predictor space. These f («lx + Ym), whose values are not directly 
observed, constitute a hidden layer, in neural net terminology. Usually the analyst 
chooses M in advance, but data-driven selection is also possible. In (12.15), the 
form of the activation function, f, is usually chosen to be logistic, namely f(z) = 
1/[1 + exp{—z}]. We use g as a link function. Parameters are estimated by minimizing 
the squared error, typically via gradient-based optimization. 

Neural networks are related to projection pursuit regression where sg in (12.12) 
is replaced by a parametric function f in (12.15), such as the logistic function. Many 
enhancements to the simple neural network model given above are possible, such as 
the inclusion of an additional hidden layer using a different activation function, say 
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FIGURE 12.6 The smooth functions 5; fitted with a projection pursuit regression model for 
the Norwegian paper data. The current residuals, namely the component-fitted smooth plus the 
overall residual, shown as dots, are plotted against each projection a} x for k = 1, 2. 


h. This layer is composed of evaluations of h at a number of linear combinations of 
the falx + Ym) for m = 1, ..., M, roughly serving as a basis for the first hidden 
layer. Neural networks are very popular in some fields, and a large number of software 
packages for fitting these models are available. 


12.1.3.3 Alternating Conditional Expectations The alternating conditional 
expectations (ACE) procedure fits models of the form 


p 
E{g(¥)Ix} = w + $ se(xe) (12.16) 
k=1 


where g is a smooth function of the response [64]. Unlike most other methods in this 

chapter, ACE treats the predictors as observations of a random variable X, and model 

fitting is driven by consideration of the joint distribution of Y and X. Specifically, the 

idea of ACE is to estimate g and sx fork = 1,..., p such that the magnitude of the 

correlation between g(Y) and X` 1 8x( Xx) is maximized subject to the constraint that 

var{g(Y)} = 1. The constant «œ does not affect this correlation, so it can be ignored. 
To fit the ACE model, the following iterative algorithm can be used: 


1. Initialize the algorithm by letting t = 0 and 8 (¥;) = (Y; — Y)/6y, where ôy 
is the sample standard deviation of the Y; values. 


2. Generate updated estimates of the additive predictor functions ‘aaa fork = 
1,..., p by fitting an additive model with the @(Y;) values as the response 
and the ama 4 ik) Values as the predictors. The backfitting algorithm from 


Section 12.1.1 can be used to fit this model. 


3. Estimate ®t) by smoothing the values of Sy STD Xi) (treated as the 


response) over the Y; (treated as the predictor). 
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4. Rescale “+ by dividing by the sample standard deviation of the ¢“+)(¥;) 
values. This step is necessary because otherwise setting both @“+) and 
i 1 coo to zero functions trivially provides zero residuals regardless of 


the data. 
2 
5. fX [82E -5$ TAKE] has converged according to a relative 


convergence criterion, stop. Otherwise, increment f and go to step 2. 


Maximizing the correlation between Yia 1 Sk(Xx) and g(Y) is equivalent to 
minimizing E {[g(Y) — 3 sk(X DI} with respect to g and {s} subject to the 
constraint that var{g(Y)} = 1. For p = 1, this objective is symmetric in X and Y: If 
the two variables are interchanged, the result remains the same, up to a constant. 

ACE provides no fitted model component that directly links E{Y|X} to the 
predictors, which impedes prediction. ACE is therefore quite different than the other 
predictor—response smoothers we have discussed because it abandons the notion of 
estimating the regression function; instead it provides a correlational analysis. Conse- 
quently, ACE can produce surprising results, especially when there is low correlation 
between variables. Such problems, and the convergence properties of the fitting al- 
gorithm, are discussed in [64, 84, 322]. 


12.1.3.4 Additivity and Variance Stabilization Another additive model vari- 
ant relying on transformation of the response is additivity and variance stabilization 
(AVAS) [631]. The model is the same as (12.16) except that g is constrained to be 
strictly monotone with 


P 
Yai} =C (12.17) 


k=1 


wr 


for some constant C. 
To fit the model, the following iterative algorithm can be used: 


1. Initialize the algorithm by letting t = 0 and 8 (¥;) = (Y; — Y)/6y, where ôy 
is the sample standard deviation of the Y; values. 

2. Initialize the predictor functions by fitting an additive model to the @©(Y;) and 
predictor data, yielding x fork = 1,..., p, as done for ACE. 


3. Denote the current mean response function as a = DA sO (X k). Toestimate 
the variance-stabilizing transformation, we must first estimate the conditional 
variance function of 8(Y) given A® = u. This function, V(w), is estimated 
by smoothing the current log squared residuals against u and exponentiating 
the result. 

4. Given V(u), compute the corresponding variance-stabilizing transformation 
YOz) = ic V(u)—!/? du. This integration can be carried out using a numer- 
ical technique from Chapter 5. 

5. Update and standardize the response transformation by defining ¢“+)(y) = 

[yO ( 3 y)) - yw" )] / 6, where W© and 6 denote the sample mean and 
standard deviation of the y (8%) values. 
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6. Update the predictor functions by fitting an additive model to the “+ )(¥;) and 
predictor data, yielding gen. fork = 1,..., p, as done for ACE. 


2 
7. fS; [am -5 MDX ip) has converged according to a relative 


convergence criterion, stop. Otherwise, increment f and go to step 3. 


Unlike ACE, the AVAS procedure is well suited for predictor-response regres- 
sion problems. Further details of this method are given by [322, 631]. 

Both ACE and AVAS can be used to suggest parametric transformations for 
standard multiple regression modeling. In particular, plotting the ACE or AVAS trans- 
formed predictor versus the untransformed predictor can sometimes suggest a simple 
piecewise linear or other transformation for standard regression modeling [157, 332]. 


12.1.4 Tree-Based Methods 


Tree-based methods recursively partition the predictor space into subregions associ- 
ated with increasingly homogeneous values of the response variable. An important 
appeal of such methods is that the fit is often very easy to describe and interpret. For 
reasons discussed shortly, the summary of the fit is called a tree. 

The most familiar tree-based method for statisticians is the classification and 
regression tree (CART) method described by Breiman, Friedman, Olshen, and Stone 
[65]. Both proprietary and open-source code software to carry out tree-based modeling 
are widely available [115, 228, 612, 629, 643]. While implementation details vary, 
all of these methods are fundamentally based on the idea of recursive partitioning. 

A tree can be summarized by two sets of information: 


e The answers to a series of binary (yes—no) questions, each of which is based on 
the value of a single predictor 


e A set of values used to predict the response variable on the basis of answers to 
these questions 


An example will clarify the nature of a tree. 


Example 12.4 (Stream Monitoring) Various organisms called macroinvertebrates 
live in the bed of a stream, called the substrate. To monitor stream health, ecologists 
use a measure called the index of biotic integrity (IBI), which quantifies the stream’s 
ability to support and maintain a natural biological community. An IBI allows for 
meaningful measurement of the effect of anthropogenic and other potential stressors 
on streams [363]. In this example, we consider predicting a macroinvertebrate IBI 
from two predictors, human population density and rock size in the substrate. The 
first predictor is the human population density (persons per square kilometer) in the 
stream’s watershed. To improve graphical presentation, the log of population density 
is used in the analysis below, but the same tree would be selected were the untrans- 
formed predictor to be used. The second predictor is the estimated geometric mean 
of the diameter of rocks collected at the sample location in the substrate, measured in 
millimeters and transformed logarithmically. These data, considered further in Prob- 
lem 12.5, were collected by the Environmental Protection Agency as part of a study 
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FIGURE 12.7 Tree fit to predict IBI for Example 12.4. The root node is the top node of the 
tree, the parent nodes are the other nodes indicated by the e symbol, and the terminal nodes 
are Mi, ..., N5. Follow the left branch from a parent node when the indicated criterion is true, 
and the right branch when it is false. 


of 353 sites in the Mid-Atlantic Highlands region of the eastern United States from 
1993 to 1998 [185]. 

Figure 12.7 shows a typical tree. Four binary questions are represented by splits 
in the tree. Each split is based on the value of one of the predictors. The left branch 
of a split is taken when the answer is yes, so that the condition labeling that split is 
met. For example, the top split indicates that the left portion of the tree is for those 
observations with rock size values of 0.4 or less (sand and smaller). Each position in 
the tree where a split is made is a parent node. The topmost parent node is called the 
root node. All parent nodes except the root node are internal nodes. At the bottom of 
the tree the data have been classified into five terminal nodes based on the decisions 
made at the parent nodes. Associated with each terminal node is the mean value of 
the IBI for all observations in that node. We would use this value as the prediction for 
any observation whose predictors lead to this node. For example, we predict IBI = 20 
for any observation that would be classified in M4. 


12.1.4.1 Recursive Partitioning Regression Trees Suppose initially that the 
response variable is continuous. Then tree-based smoothing is often called recur- 
sive partitioning regression. Section 12.1.4.3 discusses prediction of categorical 


responses. 
Consider predictor—response data where x; is a vector of p predictors associated 
with a response Y;, fori = 1, ..., n. For simplicity, assume that all p predictors are 


continuous. Let q denote the number of terminal nodes in the tree to be fitted. 

Tree-based predictions are piecewise constant. If the predictor values for the ith 
observation place it in the jth terminal node, then the ith predicted response equals a 
constant, âj. Thus, the tree-based smooth is 


q 
8x1) = X âj) (12.18) 
j=1 
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This model is fitted using a partitioning process that adaptively partitions predictor 
space into hyperrectangles, each corresponding to one terminal node. Once the par- 
titioning is complete, âj is set equal to, say, the mean response value of cases falling 
in the jth terminal node. 

Notice that this framework implies that there are a large number of possible 
trees whenever n and/or p are not trivially small. Any terminal node might be split 
to create a larger tree. The two branches from any parent node might be collapsed to 
convert the parent node to a terminal node, forming a subtree of the original tree. Any 
branch itself might be replaced by one based on a different predictor variable and/or 
criterion. The partitioning process used to fit a tree is described next. 

In the simplest case, suppose q = 2. Then we seek to split R? into two hyper- 
rectangles using one axis-parallel boundary. The choice can be characterized by a 
split coordinate, c € {1,..., p}, and a split point or threshold, t € R. The two termi- 
nal nodes are then Mı = {X; : xie < t} and No = {Xi : xic > t}. Denote the sets of 
indices of the observations falling in the two nodes as S4 and Sz, respectively. Using 
node-specific sample averages yields the fit 


M Y; Y; 
SXi) = Liesi) D = + liesa) D 75 (12.19) 
jesi! jeh 


where n j is the number of observations falling in the jth terminal node. 

For continuous predictors and ordered discrete predictors, defining a split in this 
manner is straightforward. Treatment of an unordered categorical variable is different. 
Suppose each observation of this variable may take on one of several categories. 
The set of all such categories must be partitioned into two subsets. Fortunately, we 
may avoid considering all possible partitions. First, order the categories in order 
of the average response within each category. Then, treat these ordered categories 
as if they were observations of an ordered discrete predictor. This strategy permits 
optimal splits [65]. There are also natural ways to deal with observations having some 
missing predictor values. Finally, selecting transformations of the predictors is usually 
not a problem: Tree-based models are invariant to monotone transformations of the 
predictors because the split point is determined in terms of the rank of predictor, in 
most software packages. 

To find the best tree with g = 2 terminal nodes, we seek to minimize the residual 
squared error, 


q 
RSS(c, 1) = X X C; = âj? (12.20) 


j=lieS; 


with respect to c and t, where âj = J;e S; Y;/nj;. Note that the Sj are defined using 
the values of c and t, and that RSS(c, t) changes only when memberships change in 
the sets S;. Minimizing (12.20) is therefore a combinatorial optimization problem. 
For each coordinate, we need to try at most n — 1 splits, and fewer if there are tied 
predictor values in the coordinate. Therefore, the minimal RSS(c, t) can be found by 
searching at most p(n — 1) trees. Exhaustive search to find the best tree is feasible 
when q = 2. 
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FIGURE 12.8 Partitioning of predictor space (rock size and population density variables) 
for predicting IBI as discussed in Examples 12.4 and 12.5. 


Now suppose q = 3. A first split coordinate and split point partition Ñ? into two 
hyperrectangles. One of these hyperrectangles is then partitioned into two portions 
using a second split coordinate and split point, applied only within this hyperrect- 
angle. The result is three terminal nodes. There are at most p(n — 1) choices for 
the first split. For making the second split on any coordinate different from the one 
used for the first split, there are at most p(n — 1) choices for each possible first split 
chosen. For a second split on the same coordinate as the first split, there are at most 
p(n — 2) choices. Carrying this logic on for larger q, we see that there are about 
(n—1)(n—2)---(a-—qt 1)p1! trees to be searched. This enormous number 
defies exhaustive search. 

Instead, a greedy search algorithm is applied (see Section 3.2). Each split is 
treated sequentially. The best single split is chosen to split the root node. For each 
child node, a separate split is chosen to split it optimally. Note that the q terminal 
nodes obtained in this way will usually not minimize the residual squared error over 
the set of all possible trees having q terminal nodes. 


Example 12.5 (Stream Monitoring, Continued) To understand how terminal 
nodes in a tree correspond to hyperrectangles in predictor space, recall the stream 
monitoring data introduced in Example 12.4. Another representation of the tree in 
Figure 12.7 is given in Figure 12.8. This plot shows the partitioning of the predictor 
space determined by values of the rock size and population density variables. Each 
circle is centered at an x; observation (i = 1, ...,). The area of each circle reflects 
the magnitude of the IBI value for that observation, with larger circles corresponding 
to larger IBI values. The rectangular regions labeled M4, ..., M5 in this graph cor- 
respond to the terminal nodes shown in Figure 12.7. The first split (on the rock size 
coordinate at the threshold t = 0.4) is shown by the vertical line in the middle of the 
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FIGURE 12.9 Piecewise constant tree model predictions for IBI as discussed in 
Example 12.5. 


plot. Subsequent splits partition only portions of the predictor space. For example, the 
region corresponding to rock size exceeding 0.4 is next split into two nodes, M4 and 
Ns, based on the value of the population density variable. Note that sequential split- 
ting has drawbacks: An apparently natural division of the data based on whether the 
population density exceeds about 2.5 is represented by two slightly mismatched splits 
because a previous split occurred at 0.4 on the rock size variable. The uncertainty of 
the tree structure is discussed further in Section 12.1.4.4. 

The piecewise constant model for this fitted tree is shown in Figure 12.9, with 
the IBI on the vertical axis. To best display the surface, the axes have been reversed 
compared to Figure 12.8. 


12.1.4.2 Tree Pruning For a given q, greedy search can be used to fit a tree 
model. Note that q is, in essence, a smoothing parameter. Large values of q retain 
high fidelity to the observed data but provide trees with high potential variation in 
predictions. Such an elaborate model may also sacrifice interpretability. Low values 
of q provide less predictive variability because there are only a few terminal nodes, 
but predictive bias may be introduced if responses are not homogeneous within each 
terminal node. We now discuss how to choose q. 

A naive approach for choosing g is to continue splitting terminal nodes until 
no additional split gives a sufficient reduction in the overall residual sum of squares. 
This approach may miss important structure in the data because subsequent splits 
may be quite valuable even if the current split offers little or no improvement. For 
example, consider the saddle-shaped response surface obtained when X; and X2 are 
independent predictors distributed uniformly over [—1, 1] and Y = X, X2. Then no 
single split on either predictor variable will be of much benefit, but any first split 
enables two subsequent splits that will greatly reduce the residual sum of squares. 

A more effective strategy for choosing g begins by growing the tree, splitting 
each terminal node until it has no more than some prespecified minimal number of 
observations in it or its residual squared error does not exceed some prespecified 
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percentage of the squared error for the root node. The number of terminal nodes in 
this full tree may greatly exceed q. Then, terminal nodes are sequentially recombined 
from the bottom up a way that doesn’t greatly inflate the residual sum of squares. 
One implementation of this strategy is called cost-complexity pruning [65, 540]. The 
final tree is a subtree of the full tree, selected according to a criterion that balances a 
penalty for prediction error and a penalty for tree complexity. 

Let the full tree be denoted by Tọ, and let T denote some subtree of Tp that can 
be obtained by pruning everything below some parent nodes in To. Let g(T) denote 
the number of terminal nodes in the tree T. The cost-complexity criterion is given by 


RAT) = r(T) + aq(T), (12.21) 


where r(T) is the residual sum of squares or some other measure of prediction error 
for tree T, and « is a user-supplied parameter that penalizes tree complexity. For a 
given a, the optimal tree is the subtree of Tọ that minimizes Ra(T). When a = 0, the 
full tree, Tọ, will be selected as optimal. When œ = ox, the tree with only the root 
node will be selected. If To has g(7o) terminal nodes, then there are at most g(To) 
subtrees that can be obtained by choosing different values of a. 

The best approach for selecting the value for the parameter «œ in (12.21) relies 
on cross-validation. The dataset is partitioned into V separate portions of equal size, 
where V is typically between 3 and 10. Fora finite sequence of a values, the algorithm 
proceeds as follows: 


1. Remove one of the V parts of the dataset. This subset is called the validation 
set. 


2. Find the optimal subtree for each value of a in the sequence using the remaining 
V — 1 parts of the data. 


3. For each optimal subtree, predict the validation-set responses, and compute the 
cross-validated sum of squared error based on these validation-set predictions. 


Repeat this process for all V parts of the data. For each œ, compute the total cross- 
validated sum of squares over all V data partitions. The value of a that minimizes the 
cross-validated sum of squares is selected; call it @. Having estimated the best value 
for the complexity parameter, we may now prune the full tree for all the data back to 
the subtree determined by @. 

Efficient algorithms for finding the optimal tree for a sequence of a values (see 
step 2 above) are available [65, 540]. Indeed, the set of optimal trees for a sequence 
of a values is nested, with smaller trees corresponding to larger values of a, and 
all members in the sequence can be visited by sequential recombination of terminal 
nodes from the bottom up. Various enhancements of this cross-validation strategy 
have been proposed, including a variant of the above approach that chooses the sim- 
plest tree among those trees that nearly achieve the minimum cross-validated sum of 
squares [629]. 


Example 12.6 (Stream Monitoring, Continued) Let us return to the stream 
ecology example introduced in Example 12.4. A full tree for these data was obtained 
by splitting until every terminal node has fewer than 10 observations in it or has 
residual squared error less 1% of the residual squared error for the root node. This 
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FIGURE 12.10 Cross-validation residual sum of squares versus node size for Example 12.6. 
The top horizonal axis shows the cost-complexity parameter a. 


process produced a full tree with 53 terminal nodes. Figure 12.10 shows the total 
cross-validated residual squared error as a function of the number of terminal nodes. 
This plot was produced using 10-fold cross-validation (V = 10). The full tree can be 
pruned from the bottom up, recombining the least beneficial terminal nodes, until the 
minimal value of R(T) is reached. Note that the correspondence between values of 
a and tree sizes means that we need only consider a limited collection of œ values, 
and it is therefore more straightforward to plot R(T) against q(T) instead of plotting 
against œ. The minimal cross-validated sum of squares is achieved for a tree with five 
terminal nodes; indeed, this is the tree shown in Figure 12.7. 

For this example, the selection of the optimal œ, and thus the final tree, varies 
with different random partitions of the data. The optimal tree typically has between 
3 and 13 terminal nodes. This uncertainty emphasizes the potential structural in- 
stability of tree-based models, particularly for datasets where the signal is not 
strong. 


12.1.4.3 Classification Trees Digressing briefly from this chapter’s focus 
on smoothing, it is worthwhile here to quickly summarize tree-based methods for 
categorical response variables. 

Recursive partitioning models for predicting a categorical response variable 
are typically called classification trees [65, 540]. Let each response observation Y; 
take on one of M categories. Let fp jm denote the proportion of observations in the 
terminal node JV; that are of class m (for m = 1, ..., M). Loosely speaking, all the 
observations in M} are predicted to equal the class that makes up the majority in 
that node. Such prediction by majority vote within terminal nodes can be modified in 
two ways. First, votes can be weighted to reflect prior information about the overall 
prevalence of each class. This permits predictions to be biased toward predominant 
classes. Second, votes can be weighted to reflect different losses for different types 
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of misclassification [629]. For example, if classes correspond to medical diagnoses, 
some false positive or false negative diagnoses may be grave errors, while other 
mistakes may have only minor consequences. 

Construction of a classification tree relies on partitioning the predictor space 
using a greedy strategy similar to the one used for recursive partitioning regression. 
For a regression tree split, the split coordinate c and split point t are selected by 
minimizing the total residual sum of squares within the left and right children nodes. 
For classification trees, a different measure of error is required. The residual squared 
error is replaced by a measure of node impurity. 

There are various approaches to measuring node impurity, but most are based on 
the following principle. The impurity of node j should be small when observations 
in that node are concentrated within one class, and the impurity should be large 
when they are distributed uniformly over all M classes. Two popular measures of 
impurity are the entropy, given for node j by 54 Ê jm log Î jm, and the Gini index, 
given by 5°, Ea Ê ji P jm. These approaches are more effective than simply counting 
misclassifications because a split may drastically improve the purity in a node without 
changing any classifications. This occurs, for example, when majority votes on both 
sides of a split have the same outcome as the unsplit vote, but the margin of victory 
in one of the subregions is much narrower than in the other. 

Cost—complexity tree pruning can proceed using the same strategies described in 
Section 12.1.4.2. The entropy or the Gini index can be used for the cost measure r(T) in 
(12.21). Alternatively, one may let r(T) equal a (possibly weighted) misclassification 
rate to guide pruning. 


12.1.4.4 Other Issues for Tree-Based Methods Tree-based methods offer 
several advantages over other, more traditional modeling approaches. First, tree mod- 
els fit interactions between predictors and other nonadditive behavior without re- 
quiring formal specification of the form of the interaction by the user. Second, there 
are natural ways to use data with some missing predictor values, both when fit- 
ting the model and when using it to make predictions. Some strategies are surveyed 
in [64, 540]. 

One disadvantage is that trees can be unstable. Therefore, care must be taken 
not to overinterpret particular splits. For example, if the two smallest IBI values in 
AN, in Figure 12.8 are increased somewhat, then this node is omitted when a new 
tree is constructed on the revised data. New data can often cause quite different splits 
to be chosen even if predictions remain relatively unchanged. For example, from 
Figure 12.8 it is easy to imagine that slightly different data could have led to the root 
node splitting on population density at the split point 2.5, rather than on rock size at 
the point 0.4. Trees can also be unstable in that building the full tree to a different 
size before pruning can cause a different optimal tree to be chosen after pruning. 

Another concern is that assessment of uncertainty can be somewhat challeng- 
ing. There is no simple way to summarize a confidence region for the tree structure 
itself. Confidence intervals for tree predictions can be obtained using the bootstrap 
(Chapter 9). 

Tree-based methods are popular in computer science, particular for classifi- 
cation [522, 540]. Bayesian alternatives to tree methods have also been proposed 
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[112, 151]. Medical applications of tree-based methods are particularly popular, per- 


haps because the binary decision tree is simple to explain and to apply as a tool in 
disease diagnosis [65, 114]. 
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Finally, we consider high-dimensional data that lie near a low-dimensional manifold 
such as a curve or surface. For such data, there may be no clear conceptual separation 
of the variables into predictors and responses. Nevertheless, we may be interested in 
estimating smooth relationships among variables. In this section we describe one ap- 
proach, called principal curves, for smoothing multivariate data. Alternative methods 
for discovering relationships among variables, such as association rules and cluster 
analysis, are discussed in [323]. 


12.2.1 Principal Curves 


A principal curve is a special type of one-dimensional nonparametric summary of a p- 
dimensional general multivariate dataset. Loosely speaking, each point on a principal 
curve is the average of all data that project onto that point on the curve. We began 
motivating principal curves in Section 11.6. The data in Figure 11.18 were not suitable 
for predictor-response smoothing, yet adapting the concept of smoothing to general 
multivariate data allowed the very good fit shown in the right panel of Figure 11.18. 
We now describe more precisely the notion of a principal curve and its estimation 
[321]. Related software includes [319, 367, 644]. 


12.2.1.1 Definition and Motivation General multivariate data may lie near a 
connected, one-dimensional curve snaking through ‘W?. It is this curve we want to 
estimate. We adopt a time—speed parameterization of curves below to accommodate 
the most general case. 

We can write a one-dimensional curve in X? as f(t) = (fi(t), ..., fp(t)) for t 
between to and t1. Here, t can be used to indicate distance along the one-dimensional 


curve in p-dimensional space. The arc length of a curve f is JS ii \|f’(z)|| dt, where 


d i df(t) \* 
Iæ = i( ae) psg (42) | 


If ||f’(z)|| = 1 for all t € [to, t1], then the arc length between any two points Ta 
and ty along the curve is |Ta — t)|. In this case, f is said to have the unit-speed 
parameterization. It is often helpful to imagine a bug walking forward along the 
curve at a speed of 1 or backward with at a speed of — 1 (the designation of forward 
and backward is arbitrary). Then the amount of time it takes the bug to walk between 
two points corresponds to the arc length, and the positive or negative sign corresponds 
to the direction taken. Any smooth curve with ||f’(7)|| > 0 for all t € [to, t1] can be 
reparameterized to unit speed. If the coordinate functions of a unit-speed curve are 
smooth, then f itself is smooth. 
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FIGURE 12.11 Two panels illustrating the definition of a principal curve and its estimation. 
In the left panel, the curve f is intersected by an axis that is orthogonal to f at a particular t*. 
A conditional density curve is sketched over this axis; if f is a principal curve, then the mean 
of this conditional density must equal f(t*). In the right panel, a neighborhood around t* is 
sketched. Within these boundaries, all points project onto f near t*. The sample mean of these 
points should be a good approximation to the true mean of the conditional density in the left 
panel. 


The types of curves we are interested in estimating are smooth, nonintersecting, 
curves that aren’t too wiggly. Specifically, let us assume that f is a smooth unit-speed 
curve in R? parameterized over the closed interval [to, T1] such that f(t) # f(r) when 
r +Æ t for all r, t € [to, T1], and f has finite length inside any closed ball in XP. 

For any point x € Ñ”, define the projection index function as te(x) : RP > R! 
according to 


te(x) = sup f : Ix = £(0)|| = inf |x — tol} . (12.22) 


Thus t¢(x) is the largest value of t for which f(T) is closest to x. Points with simi- 
lar projection indices project orthogonally onto a small portion of the curve f. The 
projection index will later be used to define neighborhoods. 

Suppose that X is a random vector in RP, having a probability density with 
finite second moment. Unlike in previous sections, we cannot distinguish variables 
as predictors and response. 

We define f to be a principal curve if f(t*) = E {X|te(X) = t*} for all t* € 
[to, T1]. This requirement is sometimes termed self-consistency. Figure 12.11 illus- 
trates this notion that the distribution of points orthogonal to the curve at some t must 
have mean equal to the curve itself at that point. In the left panel, a distribution is 
sketched along an axis that is orthogonal to f at one t*. The mean of this density is 
f(t*). Note that for ellipsoid distributions, the principal component lines are principal 
curves. Principal components are reviewed in [471]. 

Principal curves are motivated by the concept of local averaging: The principal 
curve connects the means of points in local neighborhoods. For predictor—response 
smooths, neighborhoods are defined along the predictor coordinate axes. For principal 
curves, neighborhoods are defined along the curve itself. Points that project nearby 
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on the curve are in the same neighborhood. The right panel of Figure 12.11 illustrates 
the notion of a local neighborhood along the curve. 


12.2.1.2 Estimation An iterative algorithm can be used to estimate a principal 
curve froma sample of p-dimensional data, X1, . . . , X,. The algorithm is initialized at 
iteration t = 0 by choosing a simple starting curve f OCT) and setting 7X) = Teo) (X) 
from (12.22). One reasonable choice would be to set f O(t) = X + ar, where a is the 
first linear principal component estimated from the data. The algorithm proceeds as 
follows: 


1. Smooth the kth coordinate of the data. Specifically, for k = 1,..., p, smooth 
X x against t)(X;) using a standard bivariate predictor-response smoother with 
span h. The projection of the points X; onto f for i = 1,...,n provides 


the predictors t(X;). The responses are the X;g. The result is fr), which 
serves as an estimate of E{X| r(x)}. This implements the scatterplot smoothing 
strategy of locally averaging the collection of points that nearly project onto 
the same point on the principal curve. 


2. Interpolate between the f+) (x;) fori =1,...,n, and compute Ti41)(X;) as 


the distances along f+), Note that some X; may project onto a quite different 
segment than they did at the previous iteration. 


3. Let f+) (xX) equal Tĝ«+1) (X) transformed to unit speed. This amounts to rescal- 
ing the Tg¢+1) (Xj) so that each equals the total distance traveled along the polyg- 
onal curve to reach it. 


4. Evaluate the convergence of f+), and stop if possible; otherwise, increment 
t and return to step 1. A relative convergence criterion could be constructed 
based on the total error, Xi] IX; — f+) (cF )(X;))]). 


The result of this algorithm is a piecewise linear polygonal curve that serves as 
the estimate of the principal curve. 

The concept of principal curves can be generalized for multivariate responses. 
For this purpose, principal surfaces are defined analogously to the above. The surface 
is parameterized by a vector t, and data points are projected onto the surface. Points 
that project anywhere on the surface near t* dominate in the local smooth at t*. 


Example 12.7 (Principal Curve for Bivariate Data) Figure 12.12 illustrates sev- 
eral steps during the iterative process of fitting a principal curve. The sequence of 
panels should be read across the page from top left to bottom right. In the first panel, 
the data are plotted. The solid line shaped like a square letter C is f, Each data point 
is connected to f by a line showing its orthogonal projection. As a bug walks along 
f (zr) from the top right to the bottom right, tx) increases from 0 to about 7. 
The second and third panels show each coordinate of the data plotted against the 
projection index, r(x). These coordinatewise smooths correspond to step 1 of the 
estimation algorithm. A smoothing spline was used in each panel, and the resulting 
overall estimate, f), is shown in the fourth panel. The fifth panel shows f2). The 
sixth panel gives the final result when convergence was achieved. 
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FIGURE 12.12 These panels illustrate the progression of the iterative fit of a principal curve. 
See Example 12.7 for details. 


12.2.1.3 Span Selection The principal curve algorithm depends on the selection 
of a span h® at each iteration. Since the smoothing is done coordinatewise, different 
spans could be used for each coordinate at each iteration, but in practice it is more 
sensible to standardize the data before analysis and then use a common h”, 

Nevertheless, the selection of h® from one iteration to the next remains an 
issue. The obvious solution is to select h” via cross-validation at each iteration. 
Surprisingly, this doesn’t work well. Pervasive undersmoothing arises because the 
errors in the coordinate functions are autocorrelated. Instead, h = h can be chosen 
sensibly and remain unchanged until convergence is achieved. Then, one additional 
iteration of step 1 can be done with a span chosen by cross-validation. 

This span selection approach is troubling because the initial span choice clearly 
can affect the shape of the curve to which the algorithm converges. When the span 
is then cross-validated after convergence, it is too late to correct f for such an error. 
Nevertheless, the algorithm seems to work well on a variety of examples where 
ordinary smoothing techniques would fail catastrophically. 


PROBLEMS 


12.1. For A defined as in (12.5), smoothing matrices S;, and n-vectors y, fork = 1,..., p, 
let Z, be the space spanned by vectors that pass through S, unchanged (i.e., vectors 
v satisfying S,;v = v). Prove that Ay = 0 (where y= (y; y2 ... Y»)') if and only if 
V: € Ty for all k and X`% y, = 0. 
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TABLE 12.2 Potential predictors of body fat. Predic- 
tors 4-13 are circumference measurements given in 
centimeters. 


1. Age (years) 8. Thigh 

2. Weight (pounds) 9. Knee 

3. Height (inches) 10. Ankle 

4. Neck 11. Extended biceps 
5. Chest 12. Forearm 

6. Abdomen 13. Wrist 

7. Hip 


12.2. Accurate measurement of body fat can be expensive and time consuming. Good mod- 
els to predict body fat accurately using standard measurements are still useful in many 
contexts. A study was conducted to predict body fat using 13 simple body measure- 
ments on 251 men. For each subject, the percentage of body fat as measured using 
an underwater weighing technique, age, weight, height, and 10 body circumference 
measurements were recorded (Table 12.2). Further details on this study are available 
in [331, 354]. These data are available from the website for this book. The goal of this 
problem is to compare and contrast several multivariate smoothing methods applied 
to these data. 


a. Using a smoother of your own choosing, develop a backfitting algorithm to fit an 
additive model to these data as described in Section 12.1.1. Compare the results of 
the additive model with those from a multiple regression model. 


b. Use any available software to estimate models for these data, using five methods: (1) 
the standard multiple linear regression (MLR) model, (2) an additive model (AM), 
(3) projection pursuit regression (PPR), (4) the alternating conditional expecta- 
tions (ACE) procedure, and (5) the additivity and variance stabilization (AVAS) 
approach. 


i. 


ii. 


For MLR, AM, ACE, and AVAS, plot the kth estimated coordinate smooth 
against the observed values of the kth predictor for k = 1,..., 13. In other 
words, graph the values of §,(x;,.) versus x; fori = 1,..., 251 asin Figure 12.2. 
For PPR, imitate Figure 12.6 by plotting each component smooth against the 
projection coordinate. For all methods, include the observed data points in 
an appropriate way in each graph. Comment on any differences between the 
methods. 


Carry out leave-one-out cross-validation analyses where the ith cross-validated 
residual is computed as the difference between the ith observed response and 
the ith predicted response obtained when the model is fitted omitting the ith data 
point from the dataset. Use these results to compare the predictive performance 
of MLR, AM, and PPR using a cross-validated residual sum of squares similar 
to (11.16). 


12.3. For the body fat data of Problem 12.2, compare the performance of at least three 
different smoothers used within an additive model of the form given in (12.3). Compare 
the leave-one-out cross-validation mean squared prediction error for the different 
smoothers. Is one smoother superior to another in the additive model? 
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12.4. 


12.5. 


12.6. 


12.7. 


12.8. 
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Example 2.5 describes a generalized linear model for data derived from testing an 
algorithm for human face recognition. The data are available from the book website. 
The response variable is binary, with Y; = 1 if two images of the same person were 
correctly matched and Y; = 0 otherwise. There are three predictor variables. The first 
is the absolute difference in eye region mean pixel intensity between the two images 
of the ith person. The second is the absolute difference in nose—cheek region mean 
pixel intensity between the two images. The third predictor compares pixel intensity 
variability between the two images. For each image of the ith person, the median 
absolute deviation (a robust spread measure) of pixel intensity is computed in two 
image areas: the forehead and nose—cheek regions. The third predictor is the between- 
image ratio of these within-image ratios. Fit a generalized additive model to these 
data. Plot your results and interpret. Compare your results with the fit of an ordinary 
logistic regression model. 


Consider a larger set of stream monitoring predictors of the index of biotic integrity 
for macroinvertebrates considered in Example 12.4. The 21 predictors, described in 
more detail in the website for this book, can be grouped into four categories: 


Site chemistry measures: Acid-neutralizing capacity, chloride, specific conductance, 
total nitrogen, pH, total phosphorus, sulfate 


Site habitat measures: Substrate diameter, percent fast water, canopy density above 
midchannel, channel slope 


Site geographic measures: Elevation, longitude, latitude, mean slope above site 


Watershed measures: Watershed area above site; human population density; percent- 
ages of agricultural, mining, forest, and urban land cover 


a. Construct a regression tree to predict the IBI. 


b. Compare the performance of several strategies for tree pruning. Compare the 
10-fold cross-validated mean squared prediction errors for the final trees selected 
by each strategy. 


c. The variables are categorized above into four groups. Create a regression tree 
using only the variables from each group in turn. Compare the 10-fold cross- 
validated mean squared prediction errors for the final trees selected for each group 
of predictors. 


Discuss how the combinatorial optimization methods from Chapter 3 might be used 
to improve tree-based methods. 


Find an example for which X = f(t) + €, where € is a random vector with mean zero, 
but f is not a principal curve for X. 


The website for this book provides some artificial data suitable for fitting a principal 
curve. There are 50 observations of a bivariate variable, and each coordinate has been 
standardized. Denote these data as x), . . . , X50. 


a. Plot the data. Let f correspond to the segment of the line through the origin with 
slope 1 onto which the data project. Superimpose this line on the graph. Imitating 
the top left panel of Figure 12.12, show how the data project onto f®. 


b. Compute t®(x;) for each data point x;. Transform to unit speed. Hint: Show why 
the transformation aTx; works, where a = (/2/2, /2/2)". 
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c. For each coordinate of the data in turn, plot the data values for that coordinate (i.e., 
the x, values fori = 1,...,50 and k = 1 or k = 2) against the projection index 
values, t(x;). Smooth the points in each plot, and superimpose the smooth on 
each graph. This mimics the center and right top panels of Figure 12.12. 


d. Superimpose f“ over a scatterplot of the data, as in the bottom left panel of 
Figure 12.12. 


e. Advanced readers may consider automating and extending these steps to produce 
an iterative algorithm whose iterates converges to the estimated principal curve. 
Some related software for fitting principal curves is available as packages for R 
(www.r-project.org). 
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adaptive MCMC, 237-249 
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bounded convergence, 238, 240, 245 
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adaptive quadrature, 147 
adaptive rejection sampling, 159-162 
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antithetic bootstrap, 302-303 
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246 


backfitting, 52-53, 394 

backtracking, 39-40, 42 
bagging, 317 

balanced bootstrap, 302 
balloon estimator, 349 


Computational Statistics, Second Edition. Geof H. Givens and Jennifer A. Hoeting. 
© 2013 John Wiley & Sons, Inc. Published 2013 by John Wiley & Sons, Inc. 


457 


458 INDEX 


bandwidth, 327, 329-339, 347-351 
for smoothing, see smoothing, bandwidth, 
374 
optimal, 332, 335 
baseball salaries, 67—68, 73—74, 78, 91—94, 
255-256 
batch method, 227 
batch times 
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Bernoulli distribution 
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simulation, 154 
beta distribution 
definition, 6 
simulation, 154 
Beverton—Holt model, 320 
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343 
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binomial distribution 
definition, 5 
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block bootstrap, 304-315 
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block size 
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BOA software, 226 
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bond variable, 275 
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aggregating, see bagging, 317 
antithetic, 302-303 
asymptotics, 315-316 
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Bayesian, 317 
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block, 304-315 
blocks of blocks, 307—309 
bumping, 317 
centering, 309-311 
circular block, 311 
confidence interval, 292-301, 315 
consistency, 315 
dependent wild, 316 
for ARQ models, 304 
for dependent data, 303-316 
for EM algorithm, 106, 110 
for independent data, 288-303, 315-316 
for regression, 290-291 
for smoothers, 384—388 
hypothesis testing, 301-302 
inferential methods, 292—302 
likelihood, 316 
moving block, 306-307 
nested, 292, 299-301, 315 
nonmoving block, 304-306 
paired, 291 
parametric, 289-290 
percentile method, 292-294, 385 
permutation, 302 
pivot, 293 
pivoting, 294-302, 309-311 
pseudo-data, 287 
regression 
cases, 291 
residuals, 290 
sieve, 316 
stationary block, 311 


studentized, see bootstrap, t, 296, 
309-311 
t, 296-298, 315 
tapered block, 316 
transformation-respecting, 294, 295 
umbrella of model parameters, see 
bumping, 317 
variance reduction, 302—303 
variance-stabilizing transformation, 293, 
298-299 
weighted likelihood, 317 
bootstrap filter, 179-180 
bounded convergence 
for adaptive MCMC, 238, 240, 245 
bowhead whales, 334—335, 337, 339, 344 
Box—Cox transformation, 353 
bracketing methods, 26 
breast cancer, 231—233, 320-321 
bridge sampling, 167-168 
BUGS software, 226 
bumping, 317 
burn-in, 220-222, 226 


call option, 191 
cancer, 320-321 
capture—recapture, 212-214, 217, 228-230 
carrying capacity, 240 
CART, see tree-based methods, 405 
Cauchy distribution 
definition, 6 
simulation, 154 
censored data, 54—56, 107—108, 112, 
123-124, 231-233 
central limit theorem, 14 
CFTP, see coupling from the past, 264 
chi-square distribution 
definition, 6 
simulation, 154 
circular block bootstrap, 311 
clinical trial, 55, 231-233, 317-318, 
320-321 
coal-mining disasters, 196-197, 233 
CODA software, 226 
Colorado topography, 176 
combinatorial optimization, 59—92 
candidate solution, 59 
genetic algorithm, 75-85 
globally competitive solution, 65 
local search, 65—68 
particle swarm, 85 
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problem complexity, 59-61 
simulated annealing, 68-75 
steepest ascent, 66 
tabu algorithm, 85-91 
traveling salesman problem, 59, 64, 70, 
79, 82-84 
complexity, 59-61 
composite rule, 129 
confidence bands, 384—388 
confidence interval 
bootstrap, 292-301, 315 
conjugate prior distribution, 12 
consistent estimator, 14 
constant-span running-mean smoother, 
366-372 
containment, see bounded convergence for 
adaptive MCMC, 238 
contraction 
for Nelder—Mead method, 47 
contractive mapping, 32-33 
control variates, 189-193 
improvement to importance sampling, 
190-191 
convergence almost surely, 13 
convergence criterion 
absolute, 131 
relative, 131 
convergence in probability, 13 
convergence order, 2 
convex function, 4 
cooling schedule, 71-72, 75 
copper-nickel alloy, 291-293, 296-297, 300 
cost—complexity pruning, 410 
coupling from the past, 264-268, 277-279 
for Markov random fields, 277—279 
credible interval, 12 
cross-validation, 332—335, 369-372, 377 
for smoothing, see smoothing, 
cross-validation, 369 
for tree-based methods, 410 
crossover, 62 
cubic smoothing spline, 341, 376 
curse of dimensionality, 152, 345, 393 
curve fitting, see smoothing, 363 
cusum diagnostic, 219 
CVRSS, see residual sum of squares, 
cross-validated, 370 
cycle, 210 
cyclic coordinate ascent, see Gauss-Seidel 
iteration, 53 
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Darwinian natural selection, 75 

decoupling, 276 

degeneracy, 171-173, 179 

delta method, 297—298 

density dependence, 241 

density estimation, 325-359, 372 
adaptive kernel, 348 
asymptotic mean integrated squared error, 

331-332, 339, 347 

balloon, 349 
bandwidth, 327, 329-339, 347-351 
bias—variance trade-off, 328—329, 340 
biased cross-validation, 334 
choice of kernel, 339-341 
cross-validation, 332-335 
exploratory projection pursuit, 353-359 
integrated squared error, 326-327, 333 


Bernoulli, 5 

beta, 6 

binomial, 5 

Cauchy, 6 
chi-square, 6 
Dirichlet, 6 
exponential, 6 
gamma, 6 
lognormal, 7 
multinomial, 5 
multivariate normal, 7 
negative binomial, 5 
normal, 7 

Poisson, 5 

Student’s t, 7 
uniform, 7 

Weibull, 7 


kernel, 327—341, 346-348, 375 

logspline, 341-345 

maximal smoothing principle, 338-339, 
348 

mean integrated squared error, 326-327, 
330-332 

mean squared error, 327 


double bootstrap, see nested bootstrap, 299 
drug abuse, 399 


earthquakes, 321 

ECM algorithm, 113-116 
edge-recombination crossover, 83-84 
effective sample size 


multivariate, 345-359 
nearest neighbor, 349-350 
plug-in methods, 335-337, 353 
product kernel, 375 
pseudo-likelihood, 333, 334 
Sheather—Jones method, 336—337 
Silverman’s rule of thumb, 335-336, 347, 
353 
transformation, 352—353 
unbiased cross-validation, 333—334, 
360-361 
univariate, 325-345 
variable-kernel, 348, 350-353 
dependent wild bootstrap, 316 
derivative-free method, 45 
detailed balance, 16, 251 
differentiation 
numerical, 3, 110-111 
diminishing adaptation 
for adaptive MCMC, 238, 240, 245, 
247-249 
Dirichlet distribution 
definition, 6 
simulation, 154 
discrete Newton methods, 41 
distributions 


for importance sampling, 182 

for Markov chain Monte Carlo, 224—225 

for sequential importance sampling, 
171-173 


EM algorithm, 97-121, 257 


acceleration methods, 118—121 
Aitken acceleration, 118-119 
ascent property, 103 

bootstrapping, 106, 110 
conditional maximization, 113-116 
convergence, 102-104 

E step, 98, 105, 111-112 

ECM, 113-116 

empirical information, 110-111 

for exponential families, 105-106 
generalized, 103, 113 

gradient EM, 116-118 

latent data, 97 

Louis’s method, 106—108 

M step, 98, 105, 112-118 

MCEM, 111-112 

missing data, 97-98 

missing information principle, 106, 108 
Monte Carlo, 111-112 

numerical differentiation of l’ (0), 111 
Q function, 98, 102, 106 


quasi-Newton acceleration, 119-121 
SEM, 106, 108-110 
supplemented, 106, 108-110 
variance estimates, 106-111 
empirical information, 111 
envelope 
for importance sampling, 163, 165, 
181-182 
for rejection sampling, 155-157 
for sequential importance sampling, 
171 
Epanechnikov kernel, 339-340 
equivalent degrees of freedom, 378 
equivalent kernels, 377-379 
ergodic Markov chain, 16, 265 
ergodic theorem, 16 
Euler—Maclaurin formula, 3, 139 
European option, 191 
evolution, see peppered moths, 99 
expanding pointwise confidence bands, 
386 
expansion 
for Nelder—Mead method, 46 
expectation—maximization algorithm, see 
EM algorithm, 97 
expected Fisher information, 10 
exploratory projection pursuit, 353-359 
exponential distribution 
definition, 6 
simulation, 154 
exponential family, 8, 36, 397 
EM algorithm, 105-106 


face recognition, 37, 417-418 
feed-forward neural network, 402-403 
finite differences, 3 
Fisher information, 10 
expected, 10 
observed, 10 
Fisher scoring, 30, 34-39, 398 
fitness, 76, 80-81 
fixed-point iteration, 32-33, 41 
convergence, 32-33 
scaled, 33 
flour beetle, 56—57 
functional, 1, 287 
functional iteration, see fixed-point iteration, 
33 
fundamental polynomials, 134 
fur seal pups, 212-214, 217, 228-230 
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GAM, see generalized additive model, 397 
gamma distribution 
definition, 6 
simulation, 154, 157 
gamma function, 8 
Gauss—Hermite quadrature, 145 
Gauss—Legendre quadrature, 145, 149 
Gauss—Newton method, 38, 44-45 
Gauss-Seidel iteration, 52—53, 114, 394 
Gaussian quadrature, 142-146 
approximation error, 144 
GDP, 305-307 
gear couplings, 123-124 
Gelman-—Rubin statistic, 221—222 
GEM algorithm, see EM algorithm, 
generalized, 103 
generalized additive model, 397-399 
additive predictor, 397 
local scoring, 398 
generalized cross-validation 
smoothing, 372, see smoothing, 
generalized cross-validation, 377 
generalized EM, see EM algorithm, 
generalized, 103 
generalized linear mixed model, 131-134 
generalized linear model, 36-37, 397, 
399 
link function, 397 
genetic algorithm, 75-85 
allele, 75, 78-79, 82 
binary encoding, 79 
chromosome, 75 
convergence, 84—85 
crossover, 76-77, 82-84 
edge recombination, 83-84 
fitness, 76, 80-81 
generation, 76, 79, 81, 84 
genetic operators, 76-77, 82-84 
genotype, 76 
locus, 75 
mutation, 77, 80, 84 
offspring, 76 
order crossover, 82 
parent, 76 
permutation chromosome, 79, 82—84 
phenotype, 76 
scaled fitness, 93 
schema, 76, 84-85 
selection mechanism, 76, 80-81 
steady-state, 81 
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genetic algorithm (Continued) 
tournament selection, 81 
uniform crossover, 94 
genetic map distance, 62 
genetic mapping, 61—64, 88, 94-95 
Gibbs sampling, 209-218, 238, 257, 
270-274 
blocked, 216 
cycle, 210 
diagnostics, see Markov chain Monte 
Carlo, diagnostics, 218 
for Markov random fields, 270—274 
griddy, 218 
hybrid, 216-217, 242 
random scan, 216, 248 
relationship to Metropolis—Hastings, 215 
reparameterization, 223-224 
globally competitive solution, 65 


gradient, 1 
gradient EM, see EM algorithm, gradient 
EM, 116 


greedy algorithm, 66, 68, 344, 408, 409, 412 
griddy—Gibbs sampling, 218 


hat matrix, 373, 376 

Hermite polynomials, 144 

Hessian matrix, | 

hidden Markov model, 124—126, 175, 176 
hierarchical centering, 223, 233-235 
highest posterior density region, 12 
Himmelblau’s function, 57 

hit-and-run algorithm, 260-261 

HIV, 121-123 

hormone treatment, 231—233 

HPD region, 12 

human face recognition, 37, 417-418 
hybrid Markov chain Monte Carlo, 216-217 


i.i.d., 9 
image analysis, 269-279 
importance ratio, 181, 204, 258 
importance sampling, 163—168, 180-186 
adaptive, 167 
choice of envelope, 181—182 
compared with sampling importance 
resampling, 183 
control variate improvement, 190-191 
effective sample size, 182 
envelope, 163, 165, 181 
importance ratio, 181, 204 


sequential, 169-179 
standardized importance weights, 164, 
181-183 
unstandardized importance weights, 
181-183, 258 
importance sampling function, 163, 165, 
181-182 
improper prior distribution, 12 
independence chain, 204—206 
industrialized countries, 305-307 
infrared emissions, 359—360 
inner product, 143 
integrated squared error, 326-327, 333 
integration 
Monte Carlo, see Monte Carlo 
integration, 151 
numerical, see numerical integration, 129 
internal node, 406 
interpolating polynomials, 130, 134 
inverse cumulative distribution function 
method, 153-155 
IRLS, see iteratively reweighted least 
squares, 36 
irreducible Markov chain, 16, 202 
ISE, see integrated squared error, 326 
iterated bootstrap, see nested bootstrap, 299 
iterated conditional modes, 114 
iteratively reweighted least squares, 36-38, 
397 


Jacobi polynomials, 144 
Jacobian matrix, 1, 9 
Jeffreys prior, 12, 213, 217 
Jensen’s inequality, 4, 103 


k-change, 65 

kernel, 327 
adaptive, 348 
asymptotic relative efficiency, 340 
biweight, 339, 380 
canonical, 340-341 
Epanechnikov, 339-340 
normal, 339 
product, 345, 346 
rescaling, 340-341 
triangle, 339 
tricube, 379 
triweight, 339 
uniform, 328, 339 
variable, 348, 350-353 


kernel density estimation, see density 
estimation, kernel, 327, 375 
kernel smoother, 374, 378 
knot 
logspline, 341 


Laguerre polynomials, 144 
Langevin Metropolis—Hastings, 262—263 
latent data, 97 
least squares cross-validation, 333 
leave-one-out, see cross-validation, 369 
Legendre polynomials, 144, 354, 357 
likelihood function, 9 
profile, 10-11, 59, 63 
line search methods, 40 
linear regression, see regression, 64 
link function, 36, 397 
linkage, 62 
Lipschitz condition, 33 
“little oh” notation, 2 
local averaging, 365 
local regression smoother, 373-376 
local scoring, 398 
local search, 65—68 
random starts, 66 
tabu algorithm, 85-91 
variable-depth, 66 
locally dependent Markov random field, 269 
locally weighted regression smoother, see 
local regression smoother, 374 
locus, 61 
loess smoother, 379-381 
logistic growth model, 56-57 
logistic regression, 36-38 
lognormal distribution 
definition, 7 
simulation, 154 
logspline density estimation, 341-345 
Louis’s method, 106-108 


macroinvertebrates, 405—406, 408-411, 418 
majorization, 104 
map distance, 62 
Maple software, 148 
mark-recapture, 212-214, 217, 228-230 
Markov chain, 14—17, 71 

aperiodic, 16, 202 

convergence, 201 

coupling, 264 

detailed balance, 16, 251 
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ergodic, 16, 265 

irreducible, 16, 202 

nonnull state, 16 

recurrent state, 16 

reversible, 16, 251 

state space, 14 

states, 14 

stationary distribution, 16, 202-203 

time-homogeneous, 15 

Markov chain Monte Carlo, 201-230, 

237-279, 325 

acceptance rate, 222, 238, 240, 245 

adaptive, 237-249 

adaptive Metropolis algorithm, 247-249 

auxiliary variable methods, 251, 256-260, 
274-277 

batch method, 227 

Bayesian estimation, 204, 212-214, 217, 
228-230 

burn-in, 220-222 

convergence, 201, 218-224 

coupling from the past, 264-268, 277-279 

cusum diagnostic, 219 

diagnostics, see burn-in, convergence, 
mixing, number of chains, run 
length, 218 

effective sample size, 224-225 

for Markov random fields, 269-279 

Gelman-Rubin statistic, 221-222 

Gibbs sampling, 209-218, 270-274 

hierarchical centering, 223, 233-235 

hit-and-run algorithm, 260-261 

hybrid strategies, 216-217, 242 

image analysis, 269-279 

independence chain, 204-206 

Langevin Metropolis—Hastings, 262-263 

maximum likelihood, 268-269 

Metropolis—Hastings, 71, 202-209 

Metropolis—Hastings ratio, 202, 204, 261 

Metropolis-within-Gibbs, 217, 238, 240 

mixing, 205, 219-224, 259, 276-277 

Monte Carlo standard error, 227 

multiple-try Metropolis—Hastings, 
261-262 

number of chains, 225—226 

perfect sampling, 260, 264-268, 277-279 

proposal distribution, 202—203, 218, 
222-223 

random walk chain, 206-209, 228 

reparameterization, 207-208, 223-224 
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Markov chain Monte Carlo (Continued) 
reversible jump, 250-256 
run length, 226 
sample path, 205, 207, 219 
simulated tempering, 257-258 
slice sampling, 258-260 
software, 226 
structured Markov chain Monte Carlo, 
216 
Swendsen—Wang algorithm, 274-277 
target distribution, 201, 257 
Markov process, 169-170 
Markov random field, 269-279 
auxiliary variable methods, 274-277 
Gibbs sampling, 270-274 
locally dependent, 269 
perfect sampling, 277-279 
MARS, 402 
Martian atmosphere, 390 
Mathematica software, 148 
MATLAB language, xvi, 17 
maximal smoothing principle, 338-339, 348 
maximum likelihood, 9-10, 21—22, 25, 27, 
33, 36, 64, 97, 106, 114, 133, 
268-269, 342 
MCEM algorithm, see Monte Carlo EM 
algorithm, 111 
MCMC, see Markov chain Monte Carlo, 201 
mean integrated squared error, 326-327, 
330-332 
mean squared error, 183, 327 
of estimation, 363 
of prediction, 364 
Metropolis—Hastings algorithm, 71, 
202-209, 257 
acceptance rate, 222 
diagnostics, see Markov chain Monte 
Carlo, diagnostics, 218 
multiple-try, 261-262 
relationship to Gibbs sampling, 215 
reparameterization, 207-208, 223-224 
Metropolis—Hastings ratio, 202, 204, 252, 
257, 261 
generalized, 262 
minorization, 104 
MISE, see mean integrated squared error, 
326 
missing data, 97—98 
missing information principle, 106, 108 
mixing, 205, 219-224, 259, 276-277 


mixture distribution, 205—209, 220 
MLE, 9 
modified Newton method, 40 
Monte Carlo EM algorithm, 111-112 
Monte Carlo integration, 107, 151-152, 
180-195, 203, 226-227 
antithetic sampling, 186-189 
control variates, 189-193 
importance sampling, 180-186 
Markov chain Monte Carlo, see Markov 
chain Monte Carlo, 201, 203, 
226-227 
Rao-Blackwellization, 193-195 
Riemann sum improvement, 195-196 
variance reduction, 180-195 
Monte Carlo maximum likelihood, 268 
moving average, 366 
moving block bootstrap, 306-307 
MSE, see mean squared error, 327 
MSPE, see mean squared error, of 
prediction, 364 
multinomial coefficient, 8 
multinomial distribution 
definition, 5 
simulation, 154 
multiple integrals, 147 
multiple-try Metropolis—Hastings algorithm, 
261-262 
multivariate adaptive regression splines, 402 
multivariate normal distribution 
definition, 7 
simulation, 154 


Nadaraya—Watson estimator, 375-376 
natural selection, 75 
navigation, see terrain navigation, 176 
nearest neighbor density estimation, 
349-350 
negative binomial distribution 
definition, 5 
simulation, 154 
neighborhood 
for local search, 65 
for Nelder—Mead method, 45 
for simulated annealing, 70, 74 
for smoothing, 365 
for tabu algorithm, 86 
k-change, 65 
Nelder—Mead algorithm, 45-52 
nested bootstrap, 292, 299-301 


network failure, 184-186, 189 
neural network, 402-403 
Newton’s method, 26-29, 34—37, 39, 118 
convergence, 27-30 
discrete, 41 
modified, 40 
Newton-Côtes quadrature, 129-139 
Newton-like methods, 39-44 
backtracking, 39-40, 42 
node 
for numerical integration, 129 
for tree-based methods, 406 
nonmoving block bootstrap, 304-306 
nonlinear equations, solving, see 
optimization, 21 
nonlinear least squares, 44 
nonnull state, 16 
nonparametric bootstrap, see bootstrap, 
287 
nonparametric density estimation, see 
density estimation, 325 
nonparametric regression, see smoothing, 
363 
normal distribution 
definition, 7 
simulation, 154 
normal kernel, 339 
normalizing constant, 11, 156, 167, 204 
Norwegian paper, 396-397, 401 
notation, 1, 4 
NP problem, 61 
NP-complete problem, 61 
NP-hard problem, 61 
numerical differentiation, 3, 110-111 
numerical integration, 129-148, 152, 298 
nth-degree rule, 138-139 
adaptive quadrature, 147 
composite rule, 129 
Gauss—Hermite quadrature, 145 
Gauss—Legendre quadrature, 145, 149 
Gaussian quadrature, 142-146 
approximation error, 144 
method of undetermined coefficients, 
138-139 
multiple integrals, 147 
Newton—Cotes quadrature, 129-139 
node, 129 
over infinite range, 146 
product formulas, 147 
Riemann rule, 130-134 
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Romberg integration, 139-142 
approximation error, 140 

simple rule, 129 

Simpson’s rule, 136-138 
adjoining subintervals, 137 
approximation error, 138 

singularities, 146 

software, 148 

transformations, 146 

trapezoidal rule, 134—136 
approximation error, 136 


OQ notation, 2 

O() notation, 2 

observed Fisher information, 10 

oil spills, 56 

optimization, 21-54 
absolute convergence criterion, 24, 34 
ascent algorithm, 39—40 
backfitting, 52-53, 394 
backtracking, 39—40, 42 
BFGS, 42 
bisection, 23-25, 30 
bracketing methods, 26 
combinatorial, see combinatorial 

optimization, 59 
convergence criteria, 24-25, 34, 49 
derivative-free, 45 
discrete Newton methods, 41 
EM algorithm, see EM algorithm, 97 
Fisher scoring, 30, 34-39 
fixed-point iteration, 32-33, 41 
Gauss—Newton, 38, 44—45 
Gauss-Seidel iteration, 52—53, 114, 394 
iterated conditional modes, 114 
iteratively reweighted least squares, 
36-38, 397 

majorization, 104 
minorization, 104 
multivariate, 34-54 
Nelder—Mead algorithm, 45-52 
Newton’s method, 26-29, 34—37, 39, 118 
Newton-like methods, 39-44 
order of convergence, 29-30 
quasi-Newton, 41-44, 119 
relative convergence criterion, 25, 34 
scaled fixed-point iteration, 33 
secant method, 30-32 
starting value, 22, 26 
steepest ascent, 39 
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optimization (Continued) 
stopping rule, 24-26 
univariate, 22—33 

optimization transfer, 104 

option pricing, 191-193, 197-198 

order crossover, 82 

order of convergence, 2 

orthogonal polynomials, 143 

orthonormal polynomials, 143 


paper manufacture, 396-397, 401 
parallel chords, method of, see fixed-point 
iteration, 33 
parametric bootstrap, see bootstrap, 
parametric, 289 
parent node, 406 
particle filter, 179-180 
particle swarm, 85 
path sampling, 167-168 
peppered moths, 99-101, 105-106, 
109-110, 117-118, 120-121 
percentile method, 292-294, 385 
perfect sampling, 260, 264-268, 277-279 
for Markov random fields, 277—279 
sandwiching, 267—268, 277-279 
permutation bootstrap, see balanced 
bootstrap, 302 
permutation test, 317-319 
pigment moisture content, 234-235 
pivotal quantity, 293-301, 387 
plug-in methods, 335-337, 353 
pointwise confidence band, 384 
Poisson distribution 
definition, 5 
simulation, 154 
polynomial algorithm, 60 
polynomials 
fundamental, 134 
Hermite, 144 
interpolating, 134 
Jacobi, 144 
Laguerre, 144 
Legendre, 144, 354, 357 
orthogonal, 143 
orthonormal, 143 
population dynamics, 240-247 
carrying capacity, 240 
density dependence, 241 
population modeling, 56-57, 240-247, 289, 
319-320 


positive definite matrix, 1 
positive semidefinite matrix, 1 
positivity, 269 
posterior distribution, 11 
predictor, 363 
predictor—response data, 363 
pressure of air blast, 390-391 
principal curves smoother, 389, 413-416 
projection index, 414 
software, 413 
span selection, 416 
principal surfaces smoother, 415 
prior distribution, 11 
conjugate, 12 
improper, 12 
Jeffreys, 12 
probability integral transform, 153 
problem complexity, 59-61 
product formulas, 147 
product kernel, 345, 346, 375 
profile likelihood, 10-11, 59, 63 
projection index, 414 
projection pursuit density estimation, see 
exploratory projection pursuit, 353 
projection pursuit regression, 399-402 
proposal distribution 
for Markov chain Monte Carlo, 202-203, 
218, 222-223 
for simulated annealing, 69, 70, 74 
pruning, 409-411 
pseudo-data, 287 
pseudo-likelihood, 333, 334 


quadrature, see numerical integration, 
129 
quasi-Newton acceleration 
for EM algorithm, 119-121 
quasi-Newton methods, 41—44, 119 
BFGS, 42 


R language, xvi, 17, 226, 344 

random ascent, 66 

random starts local search, 66 

random walk chain, 206-209 

randomization test, 318 

Rao-Blackwellization, 193—195, 227 
improvement of rejection sampling, 

194-195 
recombination, 62 
recurrent state, 16 


recursive partitioning regression, see 
tree-based methods, 405 
reflection 
for Nelder—Mead method, 46 
regression 
bootstrapping, 290-291 
cases, 291 
paired, 291 
residuals, 290 
logistic, 36-38 
recursive partitioning, see tree-based 
methods, 405 
variable selection, 64, 67—68, 73-74, 78, 
87, 91, 253-256 
with missing data, 114-116 
rejection sampling, 155-162 
adaptive, 159-162 
envelope, 155-157 
Rao-Blackwellization improvement, 
194-195 
squeezed, 158-159 
rejuvenation, 173 
relative convergence criterion, 25, 34 
residual sum of squares, 369 
cross-validated, 370 
response, 363 
reversible jump methods, 250-256 
reversible Markov chain, 16, 251 
Richardson extrapolation, 142 
Riemann rule, 130-134 
RJMCMC, see Markov chain Monte Carlo, 
reversible jump, 250 
Romberg integration, 139-142 
approximation error, 140 
root node, 406 
roughness, 330 
RSS, see residual sum of squares, 370 
running-line smoother, 372-374 
running-polynomial smoother, 373 


S-Plus language, 68 
salmon population, 319-320 
sample path, 205, 207, 219 
sample point adaptive estimator, 350 
sampling importance resampling, 163-167 
adaptive, 167 
choice of envelope, 165 
compared with importance sampling, 183 
standardized importance weights, 164 
scatterplot smoothing, see smoothing, 363 
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score equation, 21 
score function, 10 
secant condition, 41, 119 
secant method, 30-32 
convergence, 31-32 
self-avoiding walk, 198 
self-consistency, 414 
SEM algorithm, 106, 108-110 
sensitivity analysis, 184, 233 
sequential importance sampling, 169-179 
degeneracy, 171-173 
effective sample size, 171-173 
envelope, 171 
for Markov processes, 169-170 
rejuvenation, 173 
with resampling, 173, 175, 179 
sequential Monte Carlo, 168—180 
sexual histories, 121—123 
Sheather—Jones method, 336-337 
shrink transformation 
for Nelder—Mead method, 47 
sieve block bootstrap, 316 
Silverman’s rule of thumb, 335-336, 347, 
353 
simple rule for integration, 129 
simplex 
for Nelder—Mead method, 45 
Simpson’s rule, 136-138 
adjoining subintervals, 137 
approximation error, 138 
Romberg improvement of, 142 
simulated annealing, 68-75, 258 
as a Markov chain, 71 
constrained solution space, 74 
convergence, 71—72 
cooling schedule, 71-72, 75 
neighborhoods, 70, 74 
proposal distribution, 69, 70, 74 
temperature, 69 
simulated tempering, 257-258 
simulation, 151—180 
adaptive importance sampling, 167 
adaptive rejection sampling, 159-162 
approximate, 163-180 
bridge sampling, 167-168 
exact, 152—162 
importance sampling, 163-168 
inverse cumulative distribution function, 
153-155 
path sampling, 167-168 
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simulation (Continued) 
rejection sampling, 155—162 
sampling importance resampling, 
163-167 
sequential Monte Carlo, 168—180 
squeezed rejection sampling, 158-159 
standard distributions, 153 
standard parametric distributions, 154 
target distribution, 152 
uniform distribution, 153 
SIR, see sampling importance resampling, 
163 
slash distribution, 165—166 
slice sampling, 258-260 
smooth function model, 310 
smoothing, 298, 363—416 
additive model, 394-397 
additivity and variance stabilization, 
404—405 
alternating conditional expectations, 
403—404 
bandwidth, 365, 374 
confidence bands, 384—388 
constant-span running mean, 366-372 
cross-validation, 369-372, 377 
equivalent degrees of freedom, 378 
equivalent kernels, 377-379 
expanding confidence bands, 386 
generalized additive model, 397-399 
local scoring, 398 
generalized cross-validation, 372, 377 
kernel, 374, 376, 378 
linear, 365-379 
local averaging, 365 
local regression, 373—376 
locally weighted regression, see 
smoothing, local regression, 374 
loess, 379-381 
matrix, 366, 373 
mean squared estimation error, 363 
mean squared prediction error, 364 
Nadaraya—Watson estimator, 375-376 
neighborhood, 365 
nonlinear, 379-384 
principal curves, 389, 413-416 
projection index, 414 
span selection, 416 
principal surfaces, 415 
projection pursuit regression, 399-402 
running lines, 372-374 


running polynomial, 373 
span, 365, 368-372, 377 
splines, 376-377 
supersmoother, 381-384 
variable-span, 381-384 
software, 17 
density estimation, 344 
for Markov chain Monte Carlo, 162, 
226 
numerical integration, 148 
principal curves, 413 
tree-based methods, 405 
variable selection, 68 
span, see smoothing, span, 365 
sphering, 347, 353 
spline smoother, 376-377 
split coordinate, 407 
split point, 407 
square-integrable, 143 
squeezed rejection sampling, 158-159 
squeezing function, 158 
standardized importance weights, 164 
state space, 14 
states, 14 
stationary block bootstrap, 311 
stationary distribution, 16, 202-203 
steady-state genetic algorithm, 81 
steepest ascent, 39, 66 
steepest ascent/mildest descent, 66 
step length 
definition, 39 
for backtracking, 39-40, 42 
stochastic monotonicity, 267-268, 277-279 
stomach cancer, 320-321 
stream ecology, 210-212 
stream monitoring, 405—406, 408-411, 
418 
strong law of large numbers, 13, 152 
structure index, 355 
structured Markov chain Monte Carlo, 216 
Student’s ¢ distribution 
definition, 7 
simulation, 154 
studentized bootstrap, see bootstrap, t, 296 
subinterval, 129 
subtree, 407 
sufficient descent, 52 
supersmoother, 381-384 
supplemented EM algorithm, see SEM 
algorithm, 108 


survival analysis, 54—56, 123-124 

Swendsen—Wang algorithm, 274-277 
decoupling, 276 

symmetric nearest neighborhood, 366 


t distribution, see Student’s t distribution, 
154 
tabu algorithm, 85-91 
aspiration by influence, 89 
aspiration criteria, 88—89 
diversification, 89—90 
frequency, 89 
intensification, 90 
move attributes, 86, 88 
recency, 87 
tabu list, 87-88 
tabu tenure, 88 
tapered block bootstrap, 316 
target distribution, 152, 201, 257 
Taylor series, 2-3 
delta method, 297—298 
for Gauss—Newton method, 44 
for Newton’s method, 26, 28, 34 
for Simpson’s rule, 138 
for trapezoidal rule, 136 
Taylor’s theorem, 2-3 
terrain navigation, 176-180 
time-speed parameterization of curves, 413 
time-homogeneous Markov chain, 15 
tournament selection, 81 
tracking, see terrain navigation, 169 
transformation of random variables, 8—9 
trapezoidal rule, 134-136 
approximation error, 136 
Romberg improvement of, 139 
traveling salesman problem, 59, 64, 70, 79, 
82-84 
tree rings, 308-309 
tree-based methods, 405-413 
classification, 411—412 
model uncertainty, 412 
node 
internal, 406 
parent, 406 
root, 406 
pruning, 409-411 
software, 405 
split, 406 
split coordinate, 407 
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split point, 407 

subtree, 407 

tree, 405 
triangle kernel, 339 
tricube kernel, 379 
triweight kernel, 339 


UCV, see unbiased cross-validation, 333 
unbiased cross-validation, 333-334, 
360-361 
unbiased estimator, 14 
undetermined coefficients, method of, 
138-139 
uniform distribution 
definition, 7 
simulation, 153, 154 
uniform kernel, 339 
unit-speed parameterization, 413 
Utah serviceberry, 271-273, 276, 278-279 


vanishing adaptation, see diminishing 
adaptation, 238 
variable kernel, 348 
variable selection, 64, 67—68, 73-74, 78, 87, 
91, 253-256 
variable-depth local search, 66 
variable-kernel density estimator, 350-353 
variable-metric method, 42 
variable-span smoother, 381-384 
variance reduction, 180-195 
antithetic sampling, 186-189 
control variates, 189—193 
for bootstrap, 302-303 
importance sampling, 180-186 
Rao-—Blackwellization, 193-195 
Riemann sum improvement, 195-196 
variance-stabilizing transformation, 293, 
298-299 


weak law of large numbers, 13 
website for this book, xvi 
Weibull distribution 

definition, 7 
weighted likelihood bootstrap, 317 
whale migration, 334-335, 337, 339, 344 
whale population dynamics, 240-247 
whitening, 347, 353 
WinBUGS software, 162 
wine chemistry, 95 
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